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Preface 


T^e purpose of this book is to explain as simply as possible 
hov^ to perform the calculations involved in the commoner 
statistical methods. It is not in any sense a treatise on the 
theory of statistics, only sufficient theory being given to 
enable the student to understand the use and application of 
the methods described. 

No assumption of mathematical ability on the part of the 
reader^is made; the calculations described involve the use of 
arithmetic only. A worked example of each method given is 
provided and abundant exercises with answers are supplied. 

Whilst the book is chiefly addressed to students of the bio¬ 
logical sciences, especially Psychology, the methods described 
are fundamental to statistical work and should, it is hoped, 
prove useful to anyone who has to make use of elementary 
statistical methods. 

I should like to express my sincere gratitude to Dr J. 0. 
Irwin, who very kindly read the whole of the manuscript and 
made many extremely helpful criticisms, and to Dr J. Wishart, 
who made some very valuable suggestions when the manu¬ 
script was approaching its final form. I wish to acknowledge 
also the kindness of Professor R. A. Fisher and his publishers, 
Messrs Oliver and Boyd, in allowing me to print extracts from 
various statistical tables given in his book Statistical Methods 
for Research Workers, and in the statistical tables by Fisher 
and Yates, also published by Oliver and Boyd. Full re¬ 
ferences to these two works Unmade in the text. 

E. G. CHAMBERS 0 

Cambridge 
August 1940 
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Chapter I 

:ntroduction 

l.i. Description of certain statistical terms. The 
statistical methods described in this book are all concerned 
with the treatment of variables ^ By a variable is meant a 
quantity which assumes different values that may be measured 
in some appropriate unit. Height, weight, test scores, readings 
on a thermometer, etc. are examples of variables, or variates, 
as they are sometimes called*. Variables are usually denoted by 
X or Y in this book. The number of times a particular value of 
a variable occurs in a set of observations is called th q frequency v 
of occurrence of that value, and a table showing the frequency 
of occurrence of all the values of a variable in a set of observa¬ 
tions is named a frequency distribution table. 

A series of observations may be represented by one value 
which is called an average , and the way in which the different 
values of the variable lie about this average is described as the 
scatter or dispersion of the observations. Measures of averages 
and scatter are descriptive statistics, since they yield in a con¬ 
densed form a description of a whole series of observations. 
This is the first function of statistical method: the other chief 
function is the examination of various hypotheses which are 
made about observational data. 

It is usually impossible to measure all the values of any 
variable, so that the data from a single experiment are only a 
sample drawn from the total population of possible observations. 
For 03&mple, if the variable is human height, then the total 
population of that variable would be the height of every man, 
woman and child ever on earth; it is manifestly impossible to 
measure all these values and in practice we have to content 
ourselves with measuring the heights of a sample of some con¬ 
venient size. The distribution of the total population can 
usually be expressed in a mathematical form by using a small 
number of cqnstants or parameters . Obviously we can never 
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know the ejcact values of these parameters, since cannot 
measure the whole population, but we can make estimates of 
them by measuring random samples, that is, samples drawn 
purely at random from the population. These estimates are 
known as statistics, and their accuracy as estimates depends on 
the size of the sample and the type of distribution/rf the 
variable. j • 

In calculating some statistics it is essential to know the 
number of degrees of freedom available for the calculation. The 
Conception of ‘degrees of freedom ’ is not an easy one for the 
beginner, so that whenever the term is used in this book cate¬ 
gorical rules for determining the number of degrees of freedom 
are supplied. Consideration of the following example may give 
some idea of the meaning of the term. Suppose 100 shillings are 
to be shared amongst 10 boys. We may give as many as we like 
(up to a total of 99) to each of 9 of the boys, but we are bound to 
give the tenth boy what is left over, i.e. we have only 9 degrees 
of freedom in sharing out the shillings. If we were told further 
that 60 shillings were to be shared amongst the oldest 5 boys 
and the remainder amongst the youngest 5, we should only 
have 8 degrees of freedom for doing this, since the fifth boy in 
each group would have to have what was left over—we should 
not be free to vary his share. 

Two variables which are related together, so that a know¬ 
ledge of the values of one variable indicates likely values of the 
other, are said to be associated or correlated . If the variables are 
unrelated, they are said to be independent. 

Other terms, applicable to particular methods, will be de¬ 
scribed in their appropriate places. 

# l.ii. Notation. A certain amount of symbolism is egsgntial 
in the description of statistical methods. Unfortunately there 
is a lack of agreement amongst different authors, which is apt 
to be confusing to the beginner. For this reason an attempt at 
a consistent method of notation is made in this book. Being 
based on first principles it is hoptd that it will be readily under¬ 
stood by the learner* and will enable him* to follow the notation 
used in the standard books on statistical methods. Symbols in 
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general established use a*re taken over unchanged. Here again 
there is Confusion, since owing to the multiplicity of statistics 
the same symbol may have to stand for quite different quanti¬ 
ties. Thus the symbol z is used for three different quantities and 
particular care is needed on the part of the student to avoid 
Thiscqnception in such cases. 

jCer^iin symbols have the same significance throughout the 
whole off this book. For example, N always stands for the total 
number of observations in a sample and S always signifies ‘the 
sum of \ The other notation is made as unambiguous as pos¬ 
sible. Certain Greek letters are used as symbols: a list of these 
with their pronunciation is given below: 


/? (beta) 
y (gamma) 
8 (delta) 

7) (eta) 

/1 (mu) 
v (nu) 


n (pi) 
p (rho) 
a (sigma) 

<t> (Phi) 

X (hi) 


£ (capital sigma) is used in some places to indicate ‘the sum 
of’. As a rule S is used for summation of sample values and £ 
for derived quantities. 

Care must be taken by the student to avoid confusing 
suffixes and indices. Suffixes are small numbers or letters 
written after a symbol at the foot, e.g. x v cr x , etc.; these are 
merely descriptive and confine the use of the symbol to a parti¬ 
cular purpose. Indices are small numbers written after and 
above symbols and have their usual algebraical significance; 
for example, x 2 (x squared) means x multiplied by x, y 3 (y cubed) 
means y multiplied by y multiplied by y , and so on. 

The usual arithmetical symbols, +, —, x and -h, have their 
accustomed significance. There are two other symbols with 
which the non-mathematical student may not be familiar. 
Vertical lines drawn each side a quantity mean ‘the positive 
numerical value of’, e.£. | a — b | means ‘the positive numerical 
value of the difference between a and b ’. Using this notation, 
therefore, it does qot matter whether # we write | a — b | or 
| b — a |. Secondly, there is the factorial sign, ‘! \ This latter is 
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best explained by examples, e.g. 4! r stands for 4x3x2x1, 
6! for 6x5>^4x3x2xl, and so on. 

Since a good deal of arithmetical work is involved in certain 
of the statistical methods described in the flowing chapters, 
it is an advantage for the student to be familiar with the use of 

logarithms (unless he has a calculating machine availably). 

/• 

l.iii. References. Reference is made in the following 
pages to two invaluable books on statistics and to certain books 
of statistical tables. The references are made numerically to the 
following works: 

(1) An Introduction to the Theory of Statistics . G. Udny Yule 
and M. G. Kendall. 11th edition. Charles Griffin & Co. 
Ltd. 1937. 215. 

*• 

(2) Statistical Methods for Research Workers . R. A. Fisher. 
6th edition. Oliver and Boyd. 1936. 16s. 

(3) Tables for Statisticians and Biometricians, Part I. Edited 
by Karl Pearson. Biometrika Office, University College, 
London. 155. 

(4) Barlow's Tables of Squares, Cubes, Square-roots, Cube-roots 
and Reciprocals of all Integral Numbers up to 10,000. 
E. and F. N. Spon. 1930. 

(5) Statistical Tables for Biological Agricultural and Medical 
Research . Fisher and Yates. Oliver and Boyd. 1938. 
135. 6d. 

Since no attempt is made in this book to prove or justify the 
various methods and formulae used, the student wishing to go 
into such matters is referred to the first two of the foregoing 
works. 

l.iv. Use and abuse of statistical methods. The student 

who works conscientiously through the following chapters 

should learn how to make use of the commoner methods of 

statistics. He should never forget, however, that statistical 

methods ‘are merely tools for a research worker. They enable 

him to describe, relate and assess the value of his observations. 

1 

They cannot make amends for incorrect pbservation nor can 
they of themselves provide a single fact of psychology, biology 
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or any other subject of research. Statistical methods are to the 
researcn worker what tools are to a carpenter. Ihe latter has 
first to learn how to use his tools and he may then by employing 
them reveal thte useful and beautiful purposes to which his 
material may be\mt. But the tools themselves mus> be used 
1 for their correct functions. The craftsman will not, for instance, 
ijse a^mallet and chisel or a fretsaw to plane a plank of wood, 
nor will he use a hammer to drive in a screw. In the same way 
statistical methods must only be used by the research worker 
for the purposes for which they have been devised. 

Further, a carpenter’s tools cannot tell him directly anything 
about the materials he is using. They cannot by themselves dis¬ 
tinguish, between mahogany and deal nor prove that oak is 
more durable than white wood. No carpenter’s tools have ever 
yet made a piece of wood: similarly no statistical method has 
ever yet produced a biological fact. 

The student is advised, therefore, to try to acquire an under¬ 
standing of the specific purpose of each statistical method he 
learns to use, to appreciate the scope of and the assumptions 
underlying the use of each formula, and to realise that the out¬ 
come of each calculation is a statistical statement which has to 
be interpreted in terms of the particular branch of science from 
which the data for examination are drawn. 
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2.i. The arithmetic mean. The best known and most use¬ 
ful form of average is the arithmetic mean , usually referred Jo as 
the ‘mean* or the ‘average’. It is easily calculated by aiidin^ 
together all the observations to be averaged and dividing the 
sum or total by the number of observations. 


j Example 1. Find the mean of the following observations: 
22, 24, 20, 23, 21, 19, 23, 22, 20, 22, 20, 22, 23, 25, 21, 21, 22, 
24, 23, 22, 23, 21, 22, 21, 23. 

Add together all the observations. 


The sum = 549, 

The number of observations = 25, 


The arithmetic mean 


_ sum of observations 
no. of observations 
_ 549 
25 


= 21-96. 


This procedure may be generalised to cover all cases. If X 
is a variable which has different values X l9 X 2 , X 3 , etc., then 
the arithmetic mean of a number N of such values is the sum 
of the various values of X , which we denote by S(X), divided 
by N, the number of them. In general, therefore, 


v' _ Y _ S ( X ) 

m x — A — jy * 


( 1 ) 


Here m x and X (called Z-bar) are different ways of denoting 
‘the mean of X\ * * 


2.ii. If N is large and no adding machine is available, the 
process of addition may be very laborious. It may, however, 
be made easier by the construction of a Jrequency distribution 
table . This is a table showing how often each value of the 
variable occurs in the sample under consideration. In Example 
1, the values taken by X all lie between 19 and 25 inclusive. 
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If we count how many times each different value of X occurs 
and write the totals in tabular form, we obtain the frequency 
distribution table given below. 


Table I 


X 

19 

20 
21 
22 

23 

24 

25 


/ 

1 

3 

5 
7 

6 
2 
1 

25 


Jn this table the first column, headed X , shows the different 
values assumed by the variable X in the sample, and the second 
column, headed/, gives the number of times, or frequency, of 
occurrence of each. The total of the / column is, of course, the 
total number of observations we are averaging, i.e. S(f) = N. 
The next step is to write down a third column, headed fX, 
which is produced by multiplying together the corresponding 
pairs of numbers in the X and / columns. We then sum the 
fX column, giving us Z(fX), and the arithmetic mean is then 
obtained by dividing this sum by N as before; i.e. 

y 


_ T _nfx) 

— A — -jrz « 
x 


•( 2 ) 


Example 2. 

Calculate the mean 

of the observations in 

Example 1 by constructing a 

frequency of distribution table. 

X 

/ 

fX 


19 

1 

19 

Z(fX) = 549, 

20 

3 

60 

21 

5 

105 

N= 25, 

~~ 22 

7 

154 

v 549 * 
X=_ 25 

23 

6 

138 

24 

2 

48 

= 21-96. 

25 

1 

•_ 

25 

• 


25 

549 



It will be noted that this result is identical with that ob 
tained in Example 1. 
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2.iii. The method of Section 2.ii is useful when the range of 
the X values is small, but if there are many different values of 
X the method again becomes laborious. Suppose the observa¬ 
tions in Example 1 were the lengths of 25 sticks measured in 
centimetres, each one being measured to the nearest centi¬ 
metre. Now let us suppose we had a large number of such sticks 
and measured them in millimetres. The range of the measure¬ 
ments might now be from 190 mm. to 252 mm., so that’ if we 
constructed a frequency distribution table of these we should 

Table II 


X 

/ 

190-193 

2 

194-197 

4 

198-201 

7 

202-205 

12 

206-209 

19 

210-213 

24 

214-217 

2^ 

218-221 

35 

222-225 

26 v 

226-229 

21 

230-233 

18 

234-237 

13 

238-241 

6 

242-245 

5 

246-249 

2 

260-253 

1 


222 


have 63 different values of X to tabulate. This would be tedious, 
and the calculation may be shortened, with some small sacri¬ 
fice of accuracy, by subdividing the range of the X’s into a con¬ 
venient number of groups. In practice, a number of groups 
between 12 and 20 should be chosen, and the best unit for 
grouping may be found by dividing the range first by 12 and 
‘then by 20, and taking a convenient unitin between these two 
quotients. For instance, if the range is 63, then the results of 
dividing 63 first by 12 and then by 20 are 5*25 and 3-15. Hence 
a convenient working unit for grouping would be 4 in this case. 
This means that we should group the values of X together in 
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4’s, so that the first group would comprise 19Q, 191, 192 and 
193, the second 194,195,196 and 197, and so on, the last group 
being 250, 251, 252 and 253. We could now construct a fre¬ 
quency distribution table of these groups. Such a table might 
be, for example, as Table II. This gives the frequency dis¬ 
tribution of the lengths of 222 sticks measured in 4 mm. 
groups. , 

Now the method of calculating the mean length of the sticks 
from such a table depends on the assumption that the average 
length of the sticks in each group is equal to the mean value of 
X for that group. For instance, there are 12 sticks in the 202- 
205 group and we shall assume that the average length of those 
12 sticks is equal to the average of 202, 203, 204 and 205, i.e. 
203*5. This is, of course, an assumption, but the larger the 
number of readings in each group the nearer it becomes to 
being true. 

Care should be taken in arranging the grouping that this 
assumption should be as nearly as possible true for the end 
groups, leaving the middle ones, which usually have more 
readings in them, to look after themselves. For instance, if 
the two readings in the 190-193 group were both 190, then 
their average would be 190 instead of 191*5, which is assumed 
by the grouping. In such a case it would have been better to 
have started the grouping from 189, which would assume an 
average for the group of 190*5. 

In order to calculate the mean of the observations in a 
grouped frequency distribution table, such as Table II, we 
take an arbitrary origin , or starting point, and then calculate 
the discrepancy between this point and the true mean. Let us 
take our arbitrary origin near the middle of the range, since 
thjg simplifies the arithmetic. It is convenient to have it at the 
centre of a group so we will choose the 218-221 group, so that 
our arbitrary origin will be 219*5, the centre of the group. This 
group we number 0. The next group in the table, 222-225, has 
an average of 223*5, which is one group unit, or working unit, 
above the arbitrary origin, and we therefore number this 
group 1. In a similar manner the 226-229 group is numbered 2, 
and so on. 
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The 214-217 group averages 215 ; 5,cwhich is one working 
unit less than the arbitrary origin, so this group is numbered 
— 1. Similarly the 210-213 group becomes - 2, and so on. 


Emm fie 3. Calculate the mean of the data in Table II. 


t<dr 


X 

/ 

X 

fx 

190-193 

2 

-7 

-14 

194-197 

4 

-6 

-24 

198-201 

7 

-5 

-35 

202-205 

12 

-4 

-48 

206-209 

19 

-3 

-57 

210-213 

24 

-2 

-48 

214-217 

27 

-1 

-27 

218-221 

35 

0 

-253* 

222-225 

26 

1 

26 

226-229 

21 

2 

42 

230-233 

18 

3 

54 

234-237 

13 

4 

52 

238-241 

6 

5 

30 

242-245 

5 

6 

30 

246-249 

2 

7 

14 

250-253 

1 

222 

8 

8 

256 

-253 

3 


/ 


2(fx) = 3, 

N = 222, 


D=— - = 0-0,135, 
222 

w = 4, 


m a = 219*5, 

219*5+ 00135x4 
= 219-5+ 0*054 
= 219*554. 


* Since there will be no entry in the fx column corresponding to 
x = 0, this is a convenient place to add the negative entries in the fx 
column. 


We can now replace the X column by another column, which 
we will head ‘x\ which indicates the number of working units 
that each X-group lies away from the arbitrary origin. The X 
column is now neglected and a fourth column, headed fx, is 
written down. This is obtained by multiplying corresponding 
entries in the / and x columns. By adding this column weget 
E(fx), and the discrepancy, D, between the true mean and the 
arbitrary origin, in working units , is given by 


EJx) 

N 


This quantity D tells usliow many working units the true mean 
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m x lies away from the arbitrary origin, which we will call m a 
Hence the true mean is given by the formula 

m x = x = m a + Dw, .(4) 

where w is the size of the working unit. If D turns lout to be 
negative, then Dw will have to be subtracted from m a . If it 
should happen that the arbitrary origin is the true mean, as 
might happen with a perfectly symmetrical distribution, then 
D will, of course, be zero. 

The whole process of calculating the mean by this method is 
shown in Example 3. 

We conclude therefore that the mean length of the sticks was 
220 mm., to the nearest millimetre. 

&.iv. The median. Another form of average which is some¬ 
times used for convenience is the median. This, as its name im¬ 
plies, is the middle observation and it is easily found in un¬ 
grouped data by ranking the sample of observations in order 
of their size and finding the central observation, if N is odd, 
or the mean of the two observations in the middle, if N is even. 
If N is odd, the median will be the (N + l)/2th observation. If 
N is even, it will be the mean of the N/2 th and the NJ 2 + 1th 
observations. For example, the median of 2, 4, 6, 8, 10, 12 is 
the mean of the 3rd and 4th readings, i.e. (6 + 8)/2 = 7. 

The median is representative of a set of observations in the 
sense that there are exactly as many observations greater than 
it as there are less. If the distribution is perfectly symmetrical, 
the median is equal to the mean. 

Example 4. Find the median of the observations in Ex¬ 
ample 1. 

Ranking the observations in order of size we get: 19, 20, 20, 
20^21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 23, 23, 2$, 23, 
23, 23, 24, 24, 25. 

Since there are 25 observations, the median will be the 13th* 
i.e. 22. 


2.V. Finding the median for grouped data involves a little 
approximation, as was the case in finding the mean of grouped t 
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data. We assume that the values in each group are evenly 
distributed through the group and the approximate value of 
the median may be found by linear interpolation. Suppose, for 
example, we wished to find the median of the d/,ta in Table II. 
We have to find the value of the observation telow which half 
the observations lie. N in this case is 222, so that £N is 111. 
The median therefore will be the position of the mean of" the 
111th and 112th observations. By adding together the first 
seven entries in the/ column we find that 95 of the observations 
lie below 217*5,* and there are 35 observations in the 218-221 
group. Obviously, therefore, the median lies somewhere in this 
group. The position may be found by simple proportion. 
111 — 95 = 16. In the 218-221 group 16 observations will there¬ 
fore be less and 19 greater than the median, assuming that #11 
the values in this group are evenly distributed. The group con¬ 
tains 4 units, so that the position of the median is given by 
adding 16/35x4 to 217*5, the limit of the previous group. 
Hence the median of the data in Table II is 


217*5 + 


16 

35 


x4 = 217*5+1*83 = 219*33. 


This may be checked by working from the other end of the 
table. We find that 92 observations lie above 221*5, so that the 
median is 

111— Q9 

221*5- ——x4 = 221*5-2*17 = 219*33. 

o5 


2.vi. The mode. A form of average which is occasionally 
used is the mode . This is the most frequently occurring, or most 
fashionable, observation. For instance, in the data of Example 
1, the mode would be 22, since this observation occurs more 
often than any other. However, the mode cannot usually ,b© 
found as easily as this for small samples, since errors of 
sampling may result in the frequent occurrence of some 
observation remote from the true mode. 

* Note that since the measurements in the X column are made to 
the nearest millimetre, any observation up to but not equal to 217*5 
will count as 217 or under_ so that the real upp§r limit of the group 
is 217*5. Similarly the real lower limit of the 222-225 group is 221*5. 
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If a graph of the frequency distribution of any variable for 
the total population were drawn, we should have a smooth 
curve (see Fig. 2, p. 25 for an example). In such a curve the 
position of thc\mode would be given by the highest point, since 
there would be\the maximum frequency at that p»int. Such 
a curve might be symmetrical or it might be asymmetrical, or 
skew. It has been found that in curves of moderate skewness 
the position of the mode is given approximately by the relation 

^Mode = mean — 3 (mean — median). 

It is better, therefore, to make use of this equation for finding 
the mode of a sample than to rely on picking out the most 
frequently occurring observation. (We are considering here 
only distributions which give a hump-backed curve and have 
therefore only one mode. Certain distributions may have two 
or more modes, but the elementary student need not concern 
himself with these.) 

Example 5. Find the mode of the data in Table II. 

We have found previously that the mean of the data is 219*55 
and the median is 219*33. Substituting these values in the 
equation 

Mode = mean — 3 (mean — median), 
we have Mode = 219*55-3 (219*55-219*33) 

= 219*55-0*66 
= 218*89. 

2.vii. It may be seen from the above equation that if the 
mean and median are equal the mode is also equal to the mean. 
Thus in the case of a variable which is distributed in a perfectly 
symmetrical manner, the mean, the median and the mode are * 

all equal. 

* 

Note on the construction of a frequency distribution table 

If. the observations to be tabulated are on cards, the fre¬ 
quency distribution* table is easily formed by sortifig the cards 
into their appropriate groups and counting the number of 
cards in each gronp. When, however, the data are not on cards, 
a frequency table may be made by the use of a spot diagram. f 
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This is made by tabulating the X column and then going 
through the observations one at a time and putting a spot 
against the appropriate X-group for each observation. The 
data in Table II would appear as under in a spat diagram. 

Ci 0 


Spot diagram of the data in Table II 


* f 

190-193 ; 2 

194-197 4 

198-201 X ! 7 

202-205 XX! 12 

200-209 X X X X 19 

210-213 X X X X X 24 

214-217 x x x i 27 
218-221 x x x x x 35 

222-225 X X X X X * 26 

226-229 x X X X * 21 

230-233 X X X :• 18 

234-237 X X !* 13 

238-241 X * 6 

242-245 X 6 

246-249 ! 2 

250-253 * 1 


222 


It holps in counting the spots for the entrios in the / column if they 
are put down in groups of five, as above. 


EXERCISES ON CHAPTER II 

1. Find the arithmetic mean of the scores in each of the tests in 
Appendix E from A to G inclusive for 

(a) subjects 1-25, 

(b) subjects 26-50, 

(cy subjects 51-75, 

(d) subjects 76-100. 

I£?e formula (1). 

2. Find the arithmetic mean of the scores in test F for 

(a) subjects 1-50, 

(b) subjects 51-100. 

^ JJse formula (2). 
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3. Construct grouped frequency distribution tables for the scores 
in each of the tests A to O inclusive for the whole *100 subjects. Use 
the following group units and starting points. 


\Test 

Group unit 

First group 

\^L 

4 

14-17 

B 

15 

122-136 

G 

4 

0-3 

D 

4 

4-7 

E 

6 

78-83 

F 

2 

8, 9 

G 

3 

17-19 


(a) Thence, using formulae (3) and (4), calculate the mean scores 
in each of the tests A to 0 inclusive for the whole 100 subjects. 

(b) Compare the means obtained with those given by adding the 
means of the four groups of 25 subjects found in Exercise 1 and 
dividing the totals by 4. These latter means are accurate and will 
indicate the discrepancies introduced by grouping the data. 

4. Find the median of the scores in each of the tests from A to O 
inclusive for 

(a) subjects 1-50, 

(b) subjects 51-100. 

6. From the answers to Exercise 1 compute the means of the scores 
of each test from A to O inclusive for 

(а) subjects 1-50, 

(б) subjects 51-100. 

Then, making use of the answers to Exercise 4, estimate the mode of 
the scores in each test for 

(a) subjects 1-50, 

(b) subjects 51-100. 

Use the relationship ‘Mode = mean — 3 (mean— median)’. 



Chapter III 

SCATTER OR DISPERSION 

3.i. Whilst an average to some extent represents the whole 
series of observations of which it is the mean, yet it does not by 
itself convey sufficient information about those observations. 
As a general rule it is also necessary to know how the observa¬ 
tions are scattered around their average. Obviously the average 
of a given number of observations which all lie closely about 
the mean is more reliable as a representative statistic than one 
which is in the middle of a widely dispersed series of readings. 
It is important, therefore, to give an indication of the amount 
of scatter of the observations averaged. 

3.ii. The range. The simplest measure of scatter or dis¬ 
persion is the range of the observations, i.e. the distance be¬ 
tween the largest and smallest observations. The range, how¬ 
ever, is not a good measure of dispersion. It is based on two 
observations only, instead of on all the available information, 
and those two observations are liable to vary considerably in 
different samples, since the range depends essentially on the 
size of the sample. 

Some writers give the range which includes all the observa¬ 
tions except the 10% smallest and the 10% largest in an 
attempt to allow for the variability of the extremes of a sample, 
but this practice has little to recommend it. 

3.iii. The inter-quartile range. A measure of dispersion 
which is not quite so subject to fluctuations of sampling is 
• inter-quartile range . This is obtained as follows. First rank all 
the observations in order of size and find the median (see sec¬ 
tion 2.iv). This divides the sample into two equal portions. 
Then find the median of each half: that of the lower half is 
called the lower quartile, that of the upper half the upper 
quartile. These two quartdles with the median divide the whole 
'range of observations into four equal inter-quartile groups, 
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and the distance between the lower and upper quartiles is called 
the inter-quartile range. It is usual to halve this and quote the 
‘semi-inter-quantile range’. 

This measure pf dispersion makes use of slightly .more in¬ 
formation than does the total range, but it does not lend itself 
to further mathematical treatment and is not a very valuable 
measure. 


3.iv. The mean deviation. A measure of scatter which 
makes use of all the observations is obtained by writing down 
the difference between each separate observation and the 
average, adding together all these differences without regard 
to their signs, and dividing the total by the riumber of observa¬ 
tions. This is called the mean deviation or mean variation and is 
best calculated from the median, as it is then a minimum. 

Example 6 . Calculate the mean deviation of the data in 
Example 1. 

The median of the observations is 22, as was found in 
Example 4. The differences between each reading and the 
median, without regard to sign, are: 0, 2, 2, 1,1, 3, 1, 0, 2, 0, 2, 
0, 1, 3, 1, 1, 0, 2, 1, 0, 1, 1, 0, 1, 1. 

The sum of these differences = 27, 

N =25, 

27 

The mean deviation = — = 1-08. 


3.V. The standard deviation. By far the best and most 
useful measure of scatter is the standard deviation . In words, 
this is the square root of the mean of the squares of the devia¬ 
tions of the observations from their arithmetic mean. In 
sy™Jbols, if (X — X) represents the deviation of an individual 
reading from the mean and S(X — X)* the sum of the squares of 
all such deviations, then the standard deviation, or , is given by , 
the formula * 


[S(X-X ) a 

V N 


(5) 


j j 

The square of the standard deviation is called the variance 
or second* moment, the latter usually being denoted by /i 2 . 


CSC * 
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Hence the variance 


= - H 


£(X-X) 2 
N • 


/ 


[3.v- 


•(5 A) 


The method of calculating the standard d/viation depends 
on the type of data and the number of observations, and the 
following sections show how to calculate it in different cafies. 


3.vi. When N , the number of observations, is small and the 
mean of the observations is a whole number, the standard 
deviation is easily calculated by subtracting the mean from 
each observation, squaring each of these differences, adding the 
squares, dividing the sum of the squares by N and finally 
taking the square root of the quotient. 

Example 7. Find the standard deviation of the first 11 
natural numbers. 


X 

X-X 

(X-X) 2 


1 

-6 

25 


2 

-4 

16 

S(X) = 66, 

3 

— 3 

9 

N= 11, 

4 

-2 

4 


5 

-1 

1 

X=^ = 6, 

6 

0 

0 

11 

7 

1 

1 

S(X-X) 2 = 110. 

8 

2 

4 

/no 

9 

3 

9 

o* = 

10 

4 

16 

V ii 

11 

5 

25 

= VTo = 


66 110 


3.vii. It rarely happens in practice, however, that the mean 
of a set of observations is a whole number or that N is as small 
as 11. If the mean is fractional, it would be laborious to square 
all the fractional differences between the readings and tjfeir 
mean. It is possible, however, to avoid this. If the different 
.values of the variable are X l9 X 2 , X 3 , etc., then their differences 
from the mean are (X ± —X), (X 2 — X), (X 3 —X), etc. Squaring 
these we get 

(X, - X) 2 = X 2 - 2X x X + X 2 , 

(X 2 -Xf = X\-2X 2 X + X\ 

(X 3 - X) 2 = XI- 2X 3 X + X\ eto. 
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Summing these we have 

S(X - X) 2 = 8(X 2 ) - 2X .S(X) + NX*. 

Divide both sides by N . Then 

S(±-X)* 8(X») n yS(X) = a * 

V” = “iVT— 2X -^- + X • 

But — = X, by formula (1); 
tace 

N N N 

.(BB) 

In words, the variance is the mean of the squares of the 
observations less the square of their mean. Hence the standard 
deviation of the numbers in Example 7 could have been 
calculated as under: 


X 

X 2 


1 

1 


2 

4 

S(X) = 66, 

3 

9 

N= 11, 

4 

16 

X = 6. 

6 

25 

6 

36 

Variance = — 6 a 

7 

49 

11 

8 

64 

= 46-36 

9 

81 

= 10. 

10 

11 

100 

121 

Hence <r = V 10 = 3*16. 

66 

506 



The method is more usefully extended to a frequency 
distribution table, for which we have the following formula: 




<r 2 


Z(fX 2 ) 

N 


1 * 6 ) 


This may also be written 

; £(fx*) 
& 


\l\fX) 

N j* 


.(6A) 


The use of this J formula is illustrated in the following 
example# 
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- Example 8. Find the standard deviation of the data in the 
following frequency distribution table: 


X 

1 

2 

3 

4 

5 

6 


/ 

2 

6 

12 

7 

2 

1 


Construct two further columns, as uilder: 


X 

/ 

fx 

fX* 

© 

CO 

II 

II 

1 

2 

2 

2 

X(fX) = 94, 

2 

6 

12 

24 

X(/X 2 ) = 332, 

3 

12 

36 

108 

332 /94\* 

4 

7 

28 

112 

O' 3 =-( — I 

5 

2 

10 

50 

30 \30/ 

6 

1 

6 

36 

= 11*0667 — 9*8178 


30 

94 

332 

= 1*2489. 

<r=Vf2489= 1-117. 


3.viii. This method becomes laborious if the figures in the 
X column are large, since X 2 will be very much larger and the 
f(X 2 ) column will require a good deal of calculation and addi¬ 
tion. It is usually easiest, especially if N is large and the range 
of the variable wide, to work from a grouped frequency distri¬ 
bution table. The formulae for the standard deviation in this 

case are: 

_ 2 \Z[fx) 1 

N ( N J 

2 

• 

.(7), 

Using formula (3), this may also be written 


2 £(f x2 ) £2 

N ' 


.(7 A) 

Also 71 <r = tr m x o>. 

- 

.(8) 


In these formulae <r w is the standard deviation in working 
units. The method of calculation is an extension of that used 
in Example 3 and the whole process is shown below. 
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Example 9. Calculate the standard deviation of the data in 
Table II. 


X 

/ 

X 

f^ 

fx 2 

190-193 

2 

-7 

-14 

* 98 

194-197 

4 

— 6 

-24 

144 

198-201 

7 

-5 

-35 

175 

202-205 

12 

-4 

-48 

192 

206-209 

19 

-3 

-57 

171 

210-213 

24 

-2 

-48 

96 

214-217 

27 

-1 

-27 

27 

218-221 

35 

0 

"--253 

0 

222-225 

26 

1 

26 

26 

226-229 

21 

2 

42 

84 

230-233 

18 

3 

54 

162 

234-237 

13 

4 

52 

208 

238-241 

6 

5 

30 

150 

242-245 

5 

6 

30 

180 

246-249 

2 

7 

14 

98 

250-253 

1 

8 

8 

64 


222 


256 

1875 




-253 



3 


The fx 2 column is obtained by multiplying together the 
corresponding entries in the x and the fx columns. 

1875_/3^ 

" 222 \ 222 / 

== 8-4457. 

cr a = V8-4457 = 2-906. 
cr = 2-906x4= 11-624. 

C *ix. The coefficient of variation. If we wish to compare 
the scatter of different variables about their means, it is useful 
to be able to express the scatter in some form which is not de¬ 
pendent on the absolute size of the variables. Fo* instance, 
mice and men may be relatively equally variable in length, but 
this would not be revealed tty stating the standard deviation in 
inches of a sample cff each. Comparisorfmay, however, be use¬ 
fully made by calculating the coefficient of variation in each 
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case. This is readily obtained by expressing the standard devia¬ 
tion as a percentage ratio of the mean, i.e. 


V = 


lOOcr 
m * 


(9) 


where m is the mean of the sample. 

The coefficient V is a ratio and is therefore independent of 
the units in which the mean and standard deviation are 
measured. 


Example 10. Compare the relative variability of the data in 
Examples 7 and 9. 

In Example 7, X = 6, 

<7 = 3-16. 


T/ 100x3-16 „„ 

V =- r -= 52-67. 

6 


In Example 9, 


X = 219-55, 
tr = 11-624. 

T7 100x 11-624 

v = -UnTs - 6 ' 29 ' 


Hence the data in Example 7 are about 10 times as scattered 
relatively to their mean as are those in Example 9. 


EXERCISES ON CHAPTER III 

6. Find the upper and lower quartiles of the scores in each test 
from A to Q inclusive for the whole 100 subjects and thence derive the 
semi-inter-quartile range of each test. 

7. - Calculate the mean deviation of the scores in each of the fcjsts 
from A to O, using the medians given in the answers to Exercise 4, for 

(a) subjects 1-50, 

(b) subjects 51-100. 

8. Using formula (5), calculate the standard deviation of the 

scores in c 

(а) test E , subjects 1V25, * 

(б) test D, subjects 76-100. 
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9. Caloulate the standard deviation of the following test scores, 
using formula (0); tests A, 0, D, F and O: 

(а) subjects 1-25, 

(б) subjects 5}6—50, 

(c) subjects 51-75, * 

(d) subjects 76-100. 

10. Using the grouped frequency distribution tables constructed in 
Exercise 3, calculate the standard deviation of the scores of each test 
from A to O for the whole 100 subjects. Use formulae (7 A) and (8). 

From these standard deviations and the means given in the answers 
to Exercise 3 (a), calculate the coefficient of variation of each test 
score. Use formula (9). 



Chapter IV 

THE NORMAL DISTRIBUTION 

4.i. Many of the methods of statistics depend on the 
assumption that the variables under consideration are normally 
distributed and most of the methods cannot strictly be applied 
unless this assumption is justified. Biological variables are 
often in point of fact so distributed but it is not safe to assume 
that any particular distribution is normal without examination. 



Fig. 1. Histogram of frequency distribution. 

The meaning of a normal distribution may be most easily 
understood by considering certain graphs. Suppose we Con¬ 
struct a graph of the data in Table II. Along the base line we 
^ first mark off points at equal intervals to represent the X- 
groups. Oil the left we make a vertical‘scale to indicate fre¬ 
quencies. At each point marking an X-group we then draw 
vertical lines to represent frequencies: the requisite length of 
these lines is indicated by the frequency*of a particular X- 
group and the frequency scale on the left. Finally we join these 
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vertical lines together by horizontal lines and we have what is 
called a histogram of the frequency distribution of Table II. 
This is illustrated in Fig. 1. 

This figure giVps a picture of the distribution of the variable 
X and it shows that most of the observations are clustered 
about the middle of the range and that there are relatively few 
observations at the extremes. The total frequency of the 
observations is given by the area between the base-line, the top 
of the histogram and the vertical lines at the boundaries if the 
interval between the groups is 1 unit. It will be seen that the 
top of the histogram is rather irregular. If, however, a very 



Fig. 2. The normal distribution. 


large number of observations had been made and the points 
along the base-line had been much more numerous, then the 
boundary of the histogram would have become a smooth, con¬ 
tinuous line, in shape something like the curve shown in Fig. 2. 

Fig. 2 shows the shape of a normal distribution, sometimes 
called a Gaussian distribution. This curve has various im¬ 
portant properties, some of which must be mentioned. 

4.ii. The normal distribution has an exact mathematical 
formula the nature and significance of which are beyond the 
scope of this book. It is a continuous curve aM applies tojcon- 
tinuous variables, such as height, where the difference between 
one value of the variable and the next can be indefinitely small. 

• Mathematically the curve stretches to infinity in both direo- 
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tions, but practically only the portion drawn above is of im¬ 
portance. 

The mean of the observations is in the exact centre of the 
curve and there is the greatest number of observations at this 
point. Since the curve is symmetrical, the median and the 
mode coincide with the mean. 

The area between the curve and two uprights drawn at any 
points gives the fraction of the total number of observations 
between those points. In Fig. 3 uprights have been drawn at 
points corresponding to 1, 2 and 3 times the standard deviation 
of the distribution on each side of the mean. 



The area between —a* and <r is 68% of the total area. This 
means that in a normal distribution 68% of the observations 
lie within a distance equal to the standard deviation on each 
side of the mean. Similarly, from — 2crto2<r includes 95%, and 
from — 3<r to 3cr includes 99*7% of the observations. Hence it 
is obvious that in a normal distribution practically all*the 
observations lie within a range of 6 times the standard devia¬ 
tion. This provides a rough check on the size of a calculated 
standard deviation: if the number of observations is large, the 
standard deviation should be approximately a sixth of the 
range. (Note then in Example 9 the standard deviation was 
11-6 and the range was 63, nearly 6 times*as great.) 

Exactly half the observations are included in an area 
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bounded by a distance of 0-67449(r on each side of the mean, 
shown by dotted lines in Fig. 3. This means that the chances 
are exactly equal that a single observation shall deviate from 
the mean by amamount greater or less than 0*67449(r. Use is 
made of this fact in calculating probable error s ;, which will be 
explained later. 

Expressing the above measurements in a different way we 
may say that a value of X whose deviation from the mean, 
either positively or negatively, is greater than cr will occur 
roughly 1 in 3 times; a positive or negative deviation greater 
than 2cr will occur about 1 in 20 times, and greater than 3 cr 
about 1 in 370 times. Tables have been calculated showing the 
probability of obtaining deviations of any size. Such a table 
is called a Table of Probability Integral and may be found in 
Tables for Statisticians and Biometricians (Ref. 3). 

On such facts is based the conception of statistical signi¬ 
ficance, The term ‘significance’ is used in statistics to indicate 
that the odds are heavy against the deviation from its expected 
value of a particular estimate, difference or coefficient occur¬ 
ring by chance as a result of random sampling. In practice 
odds of 19 to 1 against an occurrence by chance are usually 
taken as indicating the significance of that occurrence. This 
corresponds roughly to the odds of getting a deviation from 
the mean of a normal distribution greater than twice the 
standard deviation, either positively or negatively. (Some 
statisticians prefer heavier odds, such as 99 to 1, as their 
criterion of significance. This must to some extent depend on 
the nature of the variables, but in general a probability of 19 
to 1 against an occurrence is usually regarded as sufficient for 
significance.) 

•% 

4.iii. Testing the normality of a distribution. It is un¬ 
safe to regard any bell-shaped distribution as being necessarily, 
a normal distribution, and since so much of statistical method 
depends on normality it is important to be able to test any 
given distribution for normality. This may be done quite 
simply by mathematical means, although the process requires 
,a good deal of arithmetic^ 



88 THE NORMAL DISTRIBUTION [4.tii 

< 

Essentially, testing the normality of a distribution depends 
on the calculation of two constants, fi x and /? 2 , which are de¬ 
rived from the first four moments about the mean of the distribu¬ 
tion, and two further quantities, y 1 and y 2 , \yhich are related 
to fi x and fi 2 according to the following equations: 

71 = ±VA. 

72 ~ fii~^- 

y x is a measure of whether or not the distribution is sym¬ 
metrical. y 2 measures departures of a symmetrical nature from 
normality. The use of these constants will be explained later. 

Now the first four moments about the mean of a frequency 
distribution are denoted by p v /i 2i and fi A , and these are 
calculated, with certain theoretical corrections, by a method 
which is simply an extension of that already used in calculating 
the standard deviation, as in Example 9. The student should 
now be familiar with the method of calculating £{fx] and 
E(fx 2 ). Two further columns have to be constructed, the totals 
of which will yield £(fa 3 ) and £(fx l ). If these four totals are 
divided by N we get four quantities denoted by v' v v' 2 , v f 3 and 
The moments about the mean are then obtained from the 

equations: . 

Pi = 0, 

Pz = v’z-v'fi, 

Pz = v 3 — 3v' 1 v 2 +2v' 1 3 , 

Pi — V i ~ ^ V 1 V 3 + 6 K 2 ^2 “ 

For theoretical reasons beyond the scope of this book, when 
the variate is continuous certain corrections have to be applied 
for grouping, and the equations then become: 

Pi = 0, 

P 2 = <- v i 2 ~T2> 

Pz = v 3 - 3v> 2 + 2v'fi, 

% = K ~ + KM ~ 3 K 4 ~\Pz ~ wo- 


Having obtained these four moments, fi t and fi. 2 are given by 

the formulae < „ 2 „ . 

r -Cl- R -Cl 

™ pV ™ A’ 
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From these y 1 and y 2 are readily calculated. The standard 
errors * of y x and y 2 are \/6/N and *J24/N respectively. This 
means that if the values of y x and y 2 are less than twice these 
standard errors, t^en the distribution is not significantly dif¬ 
ferent from the normal form: if they are greater than twice 
their standard errors, the distribution is not normal. 

This mass of symbolism will probably be alarming to the 
elementary student, but the actual process of calculation in¬ 
volves arithmetic only and is shown in the following example. 

Example 11. Test the normality of the distribution given in 
the first two columns of the table below: 


/ 

X 

fx 

/* 2 

Jx* 

fx* 

1 

-5 

- 5 

25 

-125 

625 

2 

-4 

- 8 

32 

-128 

512 

5 

-3 

-15 

45 

-135 

405 

10 

-2 

-20 

40 

- 80 

160 

20 

-1 

-20 

20 

- 20 

20 

50 

0 

-68 

0 

-488 

0 

22 

1 

22 

22 

22 

22 

11 

2 

22 

44 

88 

176 

5 

3 

15 

45 

135 

405 

3 

4 

12 

48 

192 

768 

1 

5 

5 

25 

125 

625 

130 


76 

346 

562 

3718 



-68 


-488 




8 


74 




v' v = 8/130 

= 

0-0615, 




v 2 = 346/130 

= 

2-6615, 




v 2 = 74/130 

= 

0-5692, 




v't = 3718/130 

= 28-6000. 



Hence 

11 2 = 2-6615 — 0*0615 2 — 0*0833 = 2-5744, 

/* 3 = 0-5692 - (3 x 2-6615 x 0-0615) + (2 x 0-0615 3 ) = 0-0787, 
/* 4 = 28-6-(4x 0-0615x 0-5692) + (6x 0-0615 2 x 2-61315) 

- (3 x 0-0615 4 ) - £(2-5744) - 0-0125 = 27-2207. 

* The meaning of*the term ‘standard* error* is explained in 
Chapter v. c 
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fix 



0-0787® 

2-5744® 

27-2207 

2-5744® 


0-000363, 


4-1072; 


* 


[4.iii-4.iv 


y l = 0-0191 {y x has the same sign as p 3 ), 


s - 8 - - Jio - °- 216 - 

y 2 = 1-1072, 


It will be seen that y x is considerably less than twice its 
standard error, hence the distribution is symmetrical. y 2 , how¬ 
ever, is more than twice its standard error, so that the distribu¬ 
tion departs from normality in a symmetrical manner. y 2 is 
said to measure kurtosis. Curves which are flat-topped and 
.short-tailed compared with the normal curve are called platy- 
kurtic : for these /? 2 is less than 3. Curves which are sharply 
peaked and long-tailed, and for which /? 2 is greater than 3, are 
called leptokurtic. In the above example, /? 2 is greater than 3, 
so that the distribution is leptokurtic and not normal. The 
student should draw a histogram of this distribution and note 
the peaked shape of it. 


4.iv. The above is not the only method of testing the 
normality of a distribution. An alternative method is to fit a 
normal curve to a frequency distribution and test the goodness 
of fit by the y® method. (See Section 9.iv.) 


EXERCISES ON CHAPTER TV • 

11. Usi Ha method of Section 4.iii (Example II), test the 
, normality of distribution of the scores in Test A for the whole 

100 subjects. Make use of the grouped frequency distribution table 
constructed.in Exercise 3. 

12. Repeat Exercise 11 for the whoie 100 scores in test O. 



Chapter V 

SIGNIFICANCE OF MEAN AND DIFFERENCE 

‘between means 


5.i. The observations recorded in a single biological experi¬ 
ment are but one sample drawn from the whole population of 
possible samples. If a second experiment is made it is unlikely 
that the mean of the observations in this case will be identical 
with that of the first experiment. In short, it will be found that 
a large number of experiments will yield many different values 
of the mean, each one departing more or less from the true 
mean of the whole population. 

If the standard deviation of the whole population is or p and 
we take a large number of random samples of n observations, 
then the means of the samples will be distributed with a stan¬ 
dard deviation crj^n. If the population is normally distributed, 
the means also will be normally distributed. Even if the distri¬ 
bution of the population is not normal, the distribution of the 
means of samples still tends to be normal if the size of the 
samples is sufficiently large, but in the case of small samples 
the distribution of the means is not normal. 

Usually we do not know the standard deviation of the whole 
population but have to take the standard deviation of an ob¬ 
served sample as an estimate of it. In this case we estimate the 
standard deviation 6f the sampling distribution from the num¬ 
ber and standard deviation of a single sample. This estimated 
value is called the standard error of the mean , i.e. 


^ S.e. of mean = , .(10) 

where cr is the standard deviation of the sampi id N the 
number of observations in it. 


(There was formerly a practice, which has little'to recom¬ 
mend it, of calculating the probable error of the mean. This is 
given by the formula 

‘ 0-674W /1 a a \ 

P.e. of mean = -^—.(10 A) 

• • . <JN 
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It may be noted that three times the probable error is roughly 
equal to twice the standard error.) 

5.ii. Significance of a single mean. If we have only a 
single sample to give estimates of X and or, thf n the distribution 
of X /or will not be normal. However, the correct distribution has 
been calculated and tables have been made enabling us to 
make use of the data from a single sample; an extract from these 
tables will be given later. 

When calculating the standard deviation for the purpose of 
examining the significance of the mean, S(X~ X) 2 should be 
divided by (N — 1) instead of by N. (The reason for this cannot 
be given here except to state that iV' — 1 represents the number 
of degrees of freedom available for calculating a*, and this 
division gives a better estimate of the standard deviation of 
the whole population.) 

Having obtained the mean and standard deviation we need 
to calculate a statistic known as t ; this is essentially the ratio 
' of the mean to its standard error, (t may also be the ratio of a 
difference between means to its standard error: see below^ 
Section 6.iii (6).) ^ 

For a single mean 

-M. ,111 

‘~ X ~JN~ c ■ . (U) 


Table III. Values of t corresponding to 
a probability P = 0*05 


n 

t 

n 

t 

n 

t 

1 

12-706 

11 

2-201 

21 

2-080 

2 

4-303 

12 

2-179 

22 

2-074 

3 

3-182 

13 

2-160 

23 

2-069 

4 

2-776 

14 

2145 

24 

2-064 

5 

2-571 

15 

2-131 

25 

2-060* 

6 

2-447 

16 

2120 

26 

2-056 

7 

2-365 

17 

2-110 

27 

2-052 

8 

2-306 

18 

2-101 

28 

2-048 

9 

* tl 2-262 

19 

2-093 4 

29 

2-045 

10 

2-228 

20 

2-086 

30 

2-042 


For n = oo, t = 1*96. 

< « 


The above table is an extract from a full table given by R. A. Fisher 
(Ref. 2), Table IV; or in the Fisher and T£ates tables, Table III (Ref. 5)< 
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In Tanle III are given values of t corresponding to different 
values of n, the number of degrees of freedom, i.e. n = N — 1 in 
this case. The odds against values of t as big as or bigger than 
these occurring by chance are 19:1, i.e. the probability, usually 
denoted by P, of their occurring by chance is 0*05. If the 
calculated value of t is greater than that given in the table 
for the appropriate value of n, then the mean is significantly 
different from zero. 

Example 12. Ten schoolchildren were given an arithmetic 
test. They were then given a month’s further tuition and a 
second test of equal difficulty was held at the end of it. Their 
marks in these two tests are given below. 


Scholar 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


Tost 1 

20 

18 

19 
22 
17 

20 
19 
16 
21 
19 


Test 2 

22 

19 

17 

18 
21 
23 

19 

20 
22 
20 


Do these marks give evidence that the scholars had benefited 
by the extra tuition? 

This problem resdlves itself into the question, Is the mean 
of the differences between successive marks significantly dif¬ 
ferent from zero? 

We need first to construct a third column giving the values 
of (T^st 2 — Test 1); this will be our X column. Add this column 
to get S(X) and obtain X by dividing S(X) by N (formula (1)). 
Make a fourth column giving {X — X) and a fifth giving 
(X — X) 2 . Add this §fth column to obtain S(X— 0 K) 2 . This 
gives us all the necessary data for calculating^. The last 
three columns and the remainder of the working are shown 
below. • • 

# Reference to Table III shows that for n = 9, t = 2-262. Our 


CSC 


3 
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Test 2-Test 1 

X 

X-X 

(X-X)* 

8(X) = 10, 

2 

1 

1 

X=l, 

1 

0 

0 

S(X-X)* _ 5$ 

— 2 r 

-3 

9 

N-4. ~ 9 ■ 

-4 

-5 

25 

/ 58 

.*. < 7 = 

4 

3 

9 

3 

2 

4 

a/ 9 

0 

-1 

1 

Hence, by formula (11), 

4 

3 

9 

_ 1 x V10 /90 

1 

0 

0 

/58 -J 58 

1 

0 

0 

= 1-25. 

10 


58 


calculated value is less than this, hence the mean of X is not 
significantly different from zero, and the marks are insufficient 
to prove the benefit of the extra tuition. 

f 

5.iii. Significance of the difference between means. 

An important and often occurring problem is to determine 
whether there is a real difference between two observational 
means or not. In statistical language this problem may be ex¬ 
pressed in the words, Is the difference between the means such 
that they might have been drawn from the same population by 
random sampling or are they drawn from two different 
populations? There are two methods of dealing *with this 
question appropriate to the cases in which the samples are 
large or in which they are small. 

(a) Large samples. If the numbers of observations in the two 
samples are large, say, at least 50, the question may be settled 
by calculating the standard error of the difference between the 
means. If the means are X x and X 2 , their standard deviations 
c t 1 and cr 2 and the numbers in the samples N x and N z re¬ 
spectively, then the standard error of the difference between 
the means is given by the formula 

' S.e. of difference = + .(12) 

This formula applies only if the two variables are independent 
or uncorrelated (see Chapter vi) # 
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(As before, the prpbaile error of the difference would be 

P.e. of difference = 0*67449 /■—■ + • 

V N x N 2 

.(HA)) 

If the difference between the two means is greater than 
twice its standard error (or three times its probable error), 
then the means are significantly different, i.e. it is unlikely that 
they would be drawn from the same population by random 
sampling, the odds against being at least 19 to 1. 

Example 13. A group of boys and a group of girls were given 
an intelligence test. The mean scores, standard deviations and 
numbers in the groups were as follows: 



Boys 

Girls 

Mean 

124 

121 

cr 

12 

10 

N 

72 

50 


iWas the mean test score of the boys significantly greater than 
that of the girls? 

Difference between the means =124-121 = 3, 

S.e. of difference = + = 2. 


* 

Hence 


Difference _ 3 
S.e. of difference 2 


In this experiment, therefore, the mean intelligence test score 
of the boys was not significantly greater than that of the girls. 

(6) Small samples . When we wish to compare the means of 
small samples of less than 50 observations, the use of formula 
(12) is no longer a sufficiently strict test and we have to apply 
thet test of significance. The test is essentially similar to that 
in the previous section but corrections have to be made to 
allow for sampling errors, which are more important in small 
samples. • # # 

Suppose we have N 1 readings of a variable x x and N 2 readings 

of a variable x * and we wish to see whether or not the means 

^ _ * • • 

x x and x 2 differ significantly from one another (or, in other 


3-2 
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words, we wish to see what is the probability that the two 
samples could be drawn from the same population). To apply 
the t test in this case we need to know six quantities, N lt N a , 
S(x t ), S(x 2 ), S(xf) and 8(xl). 

a ' i - S( Xl ) , _ S(x 2 )o 
As usual, and x 2 = ' . 

"1 

The variance of the combined observations, which we shall 
denote by <r\, is 

_ 2 _ ^( x i ~ ^ i) 2 + *^(^2 ~ x z ) 2 

d N x +N t -2 * 

and the standard error of the difference is 


S.e. of difference 


“ J N x + N a 


(The reader should compare this expression with formula (12) 
in the case where <x 1 = cr 2 .) 

t in this case is the ratio of the difference between the means 
to the standard error of the difference, i.e. 


f I «!-«« 


.(13) 


We must now express cr d in terms of the six quanlflties with 
which we started. We have 

SM) _%)I +SW . [W 


J d - 




N ,. 


Nr+Nt -2 

Written in full, therefore, the standard error of the difference is 
S.e. of difference 

The part dealing with N x and N 2 under the first square-root 
sign reduces to 


j. 


n 1+ n 2 


N\.N i .(N 1 +N i —2) 
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for which we may wfite *JN'. For the convenience of 
students, values of V1 /N' for N t and N 2 between 10 and 50 
have been tabulated in Appendix A. 

Using therefore the six quantities with which we started, we 
may therefore exjfress t in the following form: • 



(13A) 


The use of this formula is shown in the example below. 

In this case, for using Table III to find the probability of 
obtaining observed values of t, the value of n is 

7t — N N 2 — 2. 


Example 14. The span of apprehension of two small groups 
of children, one from the lowest class in a school and the other 
from the top class, was tested by seeing how many digits they 
could repeat backwards from memory after hearing them once 
repeated forwards. The numbers of digits correctly repeated in 
the two cases were as follows: 


G^up A 3 5 6 4 334 

Group B5896 12 976 


Is there any real difference in the span of apprehension of the 
two groups? • 

We have the following data: 


For group A, N 1 = 7, 

8(Xi) = 28, 

S(X\) = 120, 


For group B , N 2 = 8, 
S(X 2 ) = 62, 
S(Xi) = 516, 


Hence 


X x = 4-00 X 2 = 7*75. 

X 2 -X 1 =-7-75-4-00 = 3-75. « * 


■J l 


x i? x 13 


6-97. 


Now 


15 
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For using Table III in this case, the value of n for entry is 
7 + 8-2 = 13. The value of t in Table III corresponding to 
n = 13 is 2*160. Our calculated value of t is considerably 
greater than this, hence the difference between the means is 
significant and we conclude that the span of apprehension of 
group £ was significantly greater than that of group A. 


EXERCISES ON CHAPTER V 

13. Subtract the score in tost D from the score in test C for each 
of the first 25 subjects. Calculate the mean difference between those 
scores and determine whether this mean is significantly different from 
zero or not. Use formula (11). 

14. Using formula (12) and the standard deviations given in the 
answers to Exercise 10, calculate the standard error of the difference 
between the mean scores of the whole 100 subjects in the following 
tests: 

(a) A and 0 , (6) C and D, (c) C and F, 

(d ) D and F, ( e) D and G. (/) F and G. 

Thence determine which pairs of means are significantly different. 
Use the means given in the answer to Exercise 3 (a). 

15. Using the method of Section 5. iii (6), formula (13 A), examine 
the significance of the difference between the following pairs of means: 

(а) TeSt A : mean of subjects 1-25 and 26-50. 

(б) Test Ci mean of subjects 1-25 and 26-50. 

(c) Test Di mean of subjects 1-25 and 26-50. 

(d) Test Qi mean of subjects 1-25 and 26-50. 
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(e) Mean of subjects l*-25 in test C and mean of subjects 1-25 in 

test Q. • 

(/) Mean of subjects 51-75 in test C and mean of subjects 51-75 

in test 0 . 

* 

(g) Moan of subjects 51-75 in test A and mean of subjects 51-75 

in test D. * % 

( h) Mean of subjects 76-100 in test F and mean of subjects 76- 
100 in test Q. 



Chapter VI 

CORRELATION 

6.i. ft frequently happens in experimental work that we 
wish to know the association between two variables, that is, 
to know to what extent one variable is related to the other. 
There are various methods of measuring such association, de¬ 
pendent on the nature of the variables, their types of distribu¬ 
tion, etc. When the two variables are numerical and normally 
distributed , the association, or correlation , between them may 
best be measured by a method known as the product-moment 
method. This is by far the most useful and theoretically satis¬ 
factory method of measuring correlation and much advanced 
statistical work is based upon product-moment correlation. It 
will therefore be considered first. 

6.ii. In describing methods of correlation the two variables 
will be called X and Y . These variables will have means X 
and Y and standard deviations cr x and cr y . Since the various 
values of the variables will always be considered in pairs, an 
X with a Y, there will be the same number of X’s and T’s in 
any particular case, i.e. N. In terms of these statistics a 
quantity known as the coefficient of correlation may be cal¬ 
culated: this coefficient is denoted by r. If there is complete 
positive correlation between X and Y , r has the value 1; if there 
is complete negative correlation it has the value —1, and 
incomplete correlation gives decimal values for r between 1 and 
— 1. If there is no relation at all between the variables, r is 0. 

The meaning of the above statements may be illustrated in 
this way. Suppose the heights and weights of N people were 
measured: these would be our X and Y and there would be one 
value of X and one value of Y relating to each person. Now 
suppose the tallest person was also the heaviest, the second 
tallest thJ second heaviest and so on, until we reached the 
shortest person who was also the t lightest: in this case there 
would be complete ppsitive association, between X and Y 
and r would be 1 (provided the relation between X and Y 
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was exactly linear* and could be expressed by the equation 
Y a= a + bX). If, on the other hand, the group of persons was 
4 so peculiar that the tallest person was also the lightest, the 
second tallest the second lightest and so on to the shortest, who 
would be the heaviest in this case, then there would be com¬ 
plete negative correlation between height and weight, and r 
would be —1 (with the same proviso as before). Complete 
correlation is very rare. Usually there is a general but not 
complete agreement between two variables, so that r is 
fractional. 


O.iii. The general formula for r is quite simple. If ( X — X) and 
(Y — Y) are the deviations of corresponding values of X and Y 
from their means (i.e. the pair of values corresponding to one 
person), then these two deviations may be multiplied together 
to give the product (X — X) (Y — Y). If we add together all 
such products for the N persons, we obtain S(X — X) ( Y — Y). 
The coefficient of correlation is then given by the formula 

r = 8(X-X){7-Y) . 

Ncr x ar v 

The application of the name ‘product-moment’ to this method 
may now be appreciated. The average deviation of any value 
of X from the mean may be described as the first moment about 
the mean, and similarly for the Y 9 s. The mean product of such 
deviations is similarly called a ‘product-moment’. 

Formula (14) is the simple theoretical form. In practice con¬ 
siderations of ease in calculation necessitate modifications of 
this and we shall now consider some of them. 


6,iv. Product-moment correlation when ATis small. If 

we have a small number of X’s and Y 9 s, say less than 80, the 
calculation of r would be performed as follows. Write down the 
values of X and Y in two parallel columns in their pairs, i.e. so « 
that the pair of readings in each horizontal row belongs to the 
same person. Next calculate two more columns, headed X 2 and 
Y 2 ,by squaring the terms-in the first two columns I’he totals of 
these four columns will giye us S(X), S(Y ), aS(Z 2 ) and S(Y 2 ), 
and these data will.enable us to calculate the means and stan¬ 
dard deviations of X and Y by formulae (1) and (5 B). 
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Now instead of subtracting the meaA X /from each value of 
X and Y from each value of Y and multiplying the answers to¬ 
gether, we shall get the total of products, or product-sum, by 
a method similar to that used in Section 3.vii for calculating 
the standard deviation. We shall multiply together the actual 
values of X and Y as they stand in columns 1 and 2, sum the 
products, and then subtract the product of X and Y at the end, 
after having divided the product-sum by N. The validity of 
this may readily be seen by the following simple algebraical 
proof: 

(X-X)(7- 7) = XY-XY-YX + XY. 

S(X-X)(Y-7) = 8(XY)-XS(Y)~ 7S(X) + NX7. 

. S(Z-X)(7-7) S(XY) ^S(Y) -S(X)'^ 

N ~ N X N Y N +XY 

= ffill-zF-rx+xr 

N 

S(XY) 

= ~N XY ’ 


Accordingly we form a fifth column, headed XY, by multi¬ 
plying together corresponding X’a and Y ’s in the first two 
columns. The total of this column is S(XY). We may then 
obtain r from the formula 


r = 


S(XY) 

N 


-XY 


(14A) 


Example 15. Twenty pupils are given small tests in Arith¬ 
metic and Latin, and the marks gained in each test, from a 
maximum of 10, are shown below. 


Pupil A B 

Arith. 3 9 

Latin 1 8 

' Pupil K L 

Arith. b € . 4 

Latin 2 6 


C D E F 

7 8 4 1 

4 10 6 5 

M N 0 P 

6 5 2 0 

5 9 4 5 


Q H I J 

6 9 7 8 # 

5 3 8 7 

Q R S T 

'6 4 6 2 

7 13 5 


Calculate the coefficient of correlation between the two sets of 
marks. 
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We construct the five columns, as described above, and 
obtain $(X), $(F), 8(X 2 ), S(Y 2 ) and $(XF). The actual 
* arithmetic is shown. 


X 

Y 

X 2 

Y 2 

XY 


3 

1 

9* 

1 

3 

> 

9 

8 

81 

64 

72 

~ 107 

7 

4 

49 

16 

23 

x ~Yo ~ 5 ' 35, 

8 

10 

64 

100 

80, 


4 

6 

16 

36 

24 

F = 10 - = 5-20. 

1 

5 

1 

25 

5 

20 

6 

6 

36 

25 

30 

By formula (5B) 

9 

3 

81 

9 

27 

/ f\7 ^ 

7 

8 

49 

64 

56 

<r x = —-5-35 a = 2-242, 

8 

7 

64 

49 

56 

V 20 

5 

2 

25 

4 

10 

/ 660 

4 

6 

16 

36 

24 

'Wlo- 5 ' 20 =2 ' 441> 

6 

5 

36 

25 

30 


5 

9 

25 

81 

45 

2M6. 

2 

4 

4 

16 

8 

N 20 

6 

5 

36 

25 

30 

Hence, by formula (14A), 

5 

7 

25 

49 

35 

29-75-5-35x5-20 

4 

1 

16 

1 

4 

r — --- -—--— 

6 

3 

36 

9 

18 

2-242x2-441 

2 

5 

4 

25 

10 

= 0-353. 

107 

104 

673 

660 

595 



6. V. If the actual values of X and Y are large, a good deal of 
arithmetic would be needed to obtain the third, fourth and 
fifth columns in the above method. Provided the ranges of the 
X’s and Y ’s are fairly small, a modification of the method may 
be made by writing.down the values of X and Y as they deviate 
from suitable arbitrary origins. Such arbitrary origins would 
be chosen near the middle of the range of each variable and the 
deviations of each X and Y from these would have to be 
written with due regard to sign. For example, if the values of 
X faried from 61 to 78 we might take 70 as arbitrary origin. 
In this case a reading of 61 would be written as — 9, i.e. 61 — 70. 
In the same way, 68 would become —2, 78 would be 8 and* 
70 would be 0. * . # 

In this manner we could replace our original X and Y 
columns by two other columns recording the deviations of the 
X’s and Y ’s from their arbitrary origins. From these the 
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columns of squares and products may*be obtained as before. 
In employing this method it is advantageous to set out the plus 
and minus values of X and Y in separate columns, as shown in * 
Example 16. „ 

In this c case S(X)/N would give the difference between the 
arbitrary origin and the true mean of X, so that using the 
notation of Section 2.iii we may write S(X)/N = D xi and 
8(Y)/N = D y . Similarly, from Section 3.viii, with modifica¬ 
tions, we get 

Formula (14 A) is accordingly modified to the following in this 


S(XY) 

N 


.(14B) 


The calculation of r by this method is exemplified below. 

Example 16. Find the correlation between the two test 
scores given below for 10 subjects. 


Subject 123456789 10 

Test A 19 25 17 20 26 30 29 21 23 24 

Test B 145 151 140 144 138 140 142 150 149 150 

Test A has a range from 17 to 30, so that a convenient 
arbitrary origin will be 25. For test B a convenient origin will 
be 145, as the range is from 138 to 150. 



X 


Y 

X 2 

Y 2 


XY 

+ 


+ 

— 



+ 

— 


- 6 

0 


36 

0 

0 


0 


6 


0 

36 

0 



- 8 


- 5 

64 

25 

40 



- 5 


- 1 

25 

1 

5 


1 



- 7 

1 

49 


- 7 

5 



- 5 

25 

25 


-25 

4 



- 3 

16 

9 


-12 


- 4 

5 


16 

25 


-20 


-e2 

4 


4 

16 


- 8 


- 1 

5 


1 

25 


- 5 

To 

-26 

20 

--21 

188 

211 

45 

-77 


-16 


-1 




-32 
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Hence 


D x = -16^10 «= —1-6; D y = -1/10 = -0-1, 
(T x = V188/10—1-6 2 = 4-030, 
or v = V211/10 —0-1 2 = 4-592. 


—52/10 —( — 1-6x -0-1) 
4-030 x 4-592 


0-159. 


6.vi. Product-moment correlation when 2V is large. 

If jV is large, over 80, the foregoing methods of calculating r 
become very laborious and it is usual to curtail the arithmetic 
involved by the use of a tabular method. For this purpose we 
construct what is known as a correlation table. The construc¬ 
tion of such a table is most easily understood by reference to 
an example such as is given in Table IV, which illustrates the 
correlation between two tests, X and Y . 

In this table each of the test scores has been grouped for 
working purposes (see Section 2.iii). For test X the convenient 
group unit was 2 and accordingly the X groupings are written 
along the top of the table. Test Y had a larger range and the • 
group unit chosen was 10: the Y groupings are written at the 
left-hand side of the table. 

We next proceed to make a spot diagram showing the distri¬ 
bution of the pairs of scores. For example, one subject scored 
0 in test X and 55 in test Y ; accordingly a dot is put in the 
square, or cell , with the X grouping 0, 1 above it and the Y 
grouping 50-59 on the left of it. Similarly a dot is made in the 
appropriate cell for each other pair of scores. Hence each dot 
represents a score both in test X and in test Y and the total 
number of dots will be N , the number of subjects who took the 
tests. The number of dots in each cell is counted and the 
number written in the cell. (If the data are on cards, the cards 
may^e sorted into their appropriate groups and counted, thus 
obviating the necessity of making a spot diagram first.) 

The total number of observations in each horizontal row, or 
array, is then found ^and recorded on the right of •the table.' 
These will form a column headed f y , which will give the grouped 
frequency distribution of Y. Similarly, at the foot of each 
column of the table the total number of observations is re- 



46 


CORRELATION 


[6.vi- ' 

corded, giving a horizontal row for f x , 6r tfye grouped frequency 
distfibution of X. The totals of this last column and row, i.e. 
S(f y ) and S(f x ), should be the same and equal to N. 

We then choose an arbitrary origin for each test, oall the 
corresponding group 0, and then number t£ie groups on both 
sides as we did in Section 2.iii. This will give us a column on the 
right headed ‘ y ’ and a row at the top of the table which will 
be ‘a;\ The means and standard deviations of both X and Y 
in working units may now be calculated, using the method of 
Section 3.viii. This is most conveniently done at the side of the 
table (see Example 17). There is no need to obtain the true 
means and standard deviations of the tests—indeed it is im¬ 
portant to keep all the calculations in working units through¬ 
out. Hence we calculate D x and D y , using the notation of 
Section 2.iii, and cr X(l) and cr yw , using the notation of 
formula (7 A). 

There now remains the problem of finding the sum of the 


Table IV 


X 


X 

-6 

-5 

-4 

-3 

-2 

- 1 


0 


1 


o 

3 

4 

5 



0.1 

2.3 

4.5 

6.7 

8.9 

10. 11 

12 

. 13 

14, 15 

to, 

,17 

18. 19 

20,21 

22. 23 

tv 


. -6 
















50-59 

1 















\ 





• -3 













60-09 




1 












i 




• -4 


. — 2 












70-79 



1 


1 











2 




• -4 

• -3 


[• -4 

• 

0 

• 

1 







80-89 


i 




• 










10 




1 

1 


* 4 


3 


1 



. . 







• -4 



• -5 

• 

0 


1 







90-99 







• 









12 




1 



• 5 

• 

5 


1 











. -9 

. -6 

. — 3 

M 

0 



• 

2 





100-109 


j 


3 

3 

_3 

;; 

10 




1 




20 





• -3 

• — o' 

..-0 

M 

0 

• 

2 



^ 3 




110-119 




1 

3 

• - 6 

;• 

9 


2 



1 



22 





. -3 

• -2 

• -2 




0 

j*" 

_ 





120-129 




, 

1 

2 



J 

6 


2 




12 







• -r 

r 

"IT 

• 

2 


4 

• 3 




130-139 






_i_j 


2 


2 
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1 

3 

/ 
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0 

3 

7 

8 

21 


30 


15 


9 

3 

2 

1 

100 


-2 -10 


-1 — 8 


0 -10 

1 -10 

2 3 

3 8 

4 6 

5 II 

6 10 
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xy products. Reference to Table IV will show how this is 
done. 

* Starting with the top horizontal array, write in each cell the 
produet of the frequency in that cell and the ^-grouping above 
it. For instance, the frequency in the first cell is 1 (written in 
the bottom right-hand corner) and the ^-grouping for that cell 
is — 6; hence we write — 6 in the top right-hand corner of the 
cell. Continue this for each cell containing observations. Then 
add these small top right-hand corner numbers for each hori¬ 
zontal array and record the totals in a column on the right of 
the table. Head this column T xy , indicating the total of the 
x's for each y . As a check on the arithmetic so far, sum this 
column, and the total of it, Z(T xy ), should be equal to X(/ x £),. 
as was found in calculating the mean of x. 

Finally, we make one further column on the right of the 
table by multiplying together the entries in the y and T xy 
columns: this new column will be headed yT xy . Sum this 
column to obtain Z(yT xy ), This gives us the sum of the 
xy products. Care should be taken with the signs in all these* 
calculations. 

We have now all the necessary data and the correlation 
coefficient may be found by substitution in the modified 
formula: 


Z{yTx.y _) n n 
. N " 


x*y 


r = 


^xo) • Q'yo) 


(14 C) 


The whole of the arithmetic involved is shown in the following 
example. 

Example 17. Calculate the coefficient of correlation between 
the tests X and Y in Table IV. (See p. 48.) 

• 

6.vii. The diagonal summation method. There is an 
alternative method, which is preferred by some computers, of ^ 
calculating r from a correlation table of grouped data. The cor- # 
relation table is constructed as in Example 17 bat the small 
product figures in the top eight-hand corners of the cells are 
omitted. As in thatexample the columns for f y , y,f y y and f y y* 
are written, and alsq those for f X9 xj x x and f x x 2 . From these are 
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calculated two quantities which we shall denote by A and B . 
These are obtained as follows: 


A = 


B = Z(f^)~ 


N - 


We now construct a further column by summing the total 
frequencies in each diagonal of the table. In order to obtain the 
correct sign for r, it is important that these diagonals shall run 
from the corner of the table where both variables have low 
scores to the corner where both have high scores; in Example 
17, for instance, the diagonals will all be parallel to that running 
from the top left-hand corner to the bottom right-hand corner. 
This column we head/ d . 

We then proceed as though we were finding the standard 
deviation of this column, i.e. an arbitrary origin is chosen and 
numbered 0 and the frequencies numbered positively and 
negatively on the two sides of this origin, yielding a column 
headed d. Two further columns, headed f d d and f d d 2 , are ob¬ 
tained in the usual way and summed. From these totals we 
then obtain a third quantity which we shall denote by G. This 
is given by 


a = W 2 )- 


WM l 2 

N 


The coefficient of correlation may then be obtained from the 
formula 


A + B-G 

2 J2b 


(14D) 


(It is left to the student to relate this formula to formula (14). 
He should have no difficulty in doing this if he bears in mind 
that if X and Y represent deviations of the two variables from 
their means, then 

2 s(X Y) = S{ a* 2 )+ S{ y 2 ) - 8{x - y ) 2 r 

and that (X — Y) is constarft for any diagonal.) 

The whole of the*arithmetic involved is shown in the fol¬ 
lowing example. 


C8C . 




60 CORRELATION * [6.vii- 

Emmple 18. Calculate the coefficient of correlation of the 
data in Fig. 4 using the method of diagonal summation. 

Since in this case the f d column will have to be written on the 
right of the table, it is convenient to write th ef u and associated 
columns on the left and the f x and associated columns above 
the table. We have therefore the following setting out of the 
working: 


fjt* 36 — 48 83 32 21 

4* -6 — -12 -21 -16 -21 

-6 - 4 - 3 - 2-1 


15 36 

15 18 
1 2 


27 32 25 335 

9 8 5 65-76= -21 



' 100 
= 278 - 0 04 
= 277-96 


B.viii. Significance of product-moment correlation: 
standard error of r. The standard error of r is usually taken 


as 


S.e. of r = 


1-r 2 . 

~w 


(15) 
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As usual, the probable error is 0-67449 times the standard 
error. 

This formula is approximately true when N is large and the 
values of r are small or moderate in size. In such cases the 
correlation may tfe taken as differing significantly from zero if 
r is more than twice its standard error, or more than three times 
its probable error. 

In small samples, however, or with very large values of r, the 
above formula is not true and the significance of r should be 
assessed by the t method. For correlation coefficients 


. ‘JN -2 

t = r— r . 


(16) 


It is unnecessary to calculate t for each value of r that is found. 
Appendix B gives graphically the criterion of significance for 
r for values of N from 50 to 270. The significance of r when N 
is 50 or less may be found from the following table, extracted 
from R. A. Fisher (Ref. 2), Table V A, or Fisher and Yates 
(Ref. 5), Table VI. 


Table V. Values of r for P = 0-05 
n = N- 2 


n 

r 

n 

r 

1 

0*997 

14 

0*497 

2 

0*950 

15 

0*482 

3 

0*878 

16 

0*468 

4 

0*811 

17 

0*456 

5 

0*7^5 

18 

0*444 

6 

0*707 

19 

0*433 

7 

0*666 

20 

0*423 

8 

0*632 

25 

0*381 

9 

0*602 

30 

0*349 

10 

0*576 

35 

0*325 

11 

0*553 

40 

0*304 

12 

0*532 

45 

0*288 

13 

0*514 

60 

0*273 


In using the above table n is 2 less than V, the number of 
pairs of observations in the*correlation. If the calculated value 
of r is as big as or bigger than the value given in the table for 
jthe appropriate value of n, the correlation differs significantly 
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from^zero, i.e. it indicates a real degree of association between 
the two variables. 

The use of these various assessments of significance is illu¬ 
strated in the following example. 

Example 19. The scores in two tests oil 100 subjects are 
correlated and the value of r obtained is 0-35. Is the correlation 
significant? 

(a) By formula (15): 


S.e. of r 


1-0-1225 

10 


0-08775. 


Hence r is significant, 
standard error. 

(b) By formula (16): 


V — . ... — U IV, 

VI-0-1226 

This value of t is much larger than that given in Table III, 
p. 32, for n = 30, so that it must be even larger than that for 
n — 98; hence r is significantly larger than zero. 

(c) By consulting the graph in Appendix B it will be seen 
that for N = 100 a value of r of 0-193 is significant. Hence our 
present value of 0-35 is definitely significant. 


since 0-35 is just about 4 times its 
0-35 V98 „ _ 


6.ix. Significance of the difference between two corre¬ 
lations. We sometimes wish to know whether the correlation 
between two variables is different in two different samples. To 


Table VI. Conversion of r into z and z into r 



For z 


For r 

r 

add 

z 

subtract 

0*000-0* 114 

0*000 

0*000-0*114 

0*000 

0*115-0*163 

0*001 

0*115-0*165 

0*001 

0*164-0*194 

0*002 

0*166-0*196 

0*002 

0*195-0*216 

0*003 

0*197-0*220 

0*003 

0*217-0*235 

0*004 

0*221-0*240 

0*004 

0*236-0*251 

0*005 

0*241-0*256 

0*005 

0*252-(V265 

0*006 

0*257-0*271 

0*006 

0*266-0*277 

0*007 

0*272-0*285 

0*007 

0*276-0*288 

0*008 

0*286-0*297 

0*008 

0*289-0*299 

0*009 

0*298-0*369 

0*009 

0*300-0*309 

0*010 

0*310-0*320 

0*010 
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•Table VI {continued) 



For z 


For r 

r 

add 

z 

subtract 

'0-310-0-318 

0-011 

0-321-0-330 

0-011 

0-319-0-327 • 

0-012 

0-331-0-339 

0-012 

0-328-0-335 

0-013 

0-340-0-348 

0-013 

0-336-0-343 

0-014 

0-349-0-357 

0-014 

0-344-0-350 

0-015 

0-358-0-365 

0-015 

0-351-0-357 

0-016 

0-366-0-373 

0-016 

0-358-0-364 

0-017 

0-374-0-381 

0-017 

0-365-0-371 

0-018 

0-382-0-389 

0-018 

0-372-0-377 

0-019 

0-390-0-396 

0-019 

0-378-0-383 

0-020 

0-397-0-403 

0-020 

0-384-0-388 

0-021 

0-404-0-409 

0-021 

0-389-0-393 

0-022 

0-410-0-416 

0-022 

0-394-0-399 

0-023 

0-417-0-422 

0-023 

0-400-0-404 

0-024 

0-423-0-428 

0-024 

0-405-0*409 

0-025 

0-429-0-434 

0-025 

0-410-0-414 

0-026 

0-435-0-440 

0-026 

0-415-0-419 

0-027 

0-441-0-446 

0-027 

0-420-0-423 

0-028 

0-447-0-452 

0-028 

0-424-0-428 

0-029 

0-453-0-457 

0-029 

0-429-0-432 

0-030 

0-458-0-463 

0-030 

0-433-0-436 

0-031 

0-464-0-468 

0-031 

0-437-0-441 

0-032 

0-469-0-473 

0-032 

0-442-0-445 

0-033 

0-474-0-478 

0-033 

0-446-0-449 

0-034 

0-479-0-483 

0-034 

0-450-0-453 

0-035 

0-484-0*488 

0*035 

0-454-0-456 

0-036 

0-489-0-493 

0-036 

0-457-0-460 

0-037 

0-494-0-498 

0-037 

0-461-0-464 

0-038 

0-499-0-502 

0-038 

0-465-0-467 

0-039 

0-503-0-507 

0-039 

0-468-0-471 

0-040 

0-508-0-512 

0-040 

0-472-0-474 

0-041 

0-513-0-516 

0-041 

0-475-0-478 

0-042 

0-517-0-520 

0-042 

0-479-0-481 

0-043 

0-521-0-525 

0-043 

0-482-0-484 

0-044 

0-526-0-529 

0-044 

• 0-485-0-488 

0-045 

0-530-0-533 

0-045 

0-489-0-491 

0-046 

0-534-0-537 

0-046 

0-492-0-494 

0-047 

0-538-0-542 

0-047 

0-495-0-497 

0-048 

0-543-0-546 

0-048 

0-498-0-500 

• 0-049 

0-547-0-550 

0049 


# 


To use this table, look up thg value of r in the left-hand column and 
add to it the corresponding value in the second column, as z is always 
bigger than r. To turn z into r, look up z in the third column and 
subtract the corresponding entry in the last column. 
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do tfiis we make use of a method devisdd by Fisher (Ref. 2) 
which entails transforming r into a quantity which he calls z. 
This is given by the formula 

, * = i 0°&( 1 + *•) “ lo Se ( 1 - »?}• 

Once again there is no need to make the actual calculation as 
Table VI gives values of z corresponding to values of r up to 
0*500, and vice versa. For values outside the range of this 
table the student is referred to Table V B in Fisher (Ref. 2), or 
Table VII in Fisher and Yates (Ref. 5). 

For a value of N greater than 3, the standard error of z is 
1/<JN — 3, and the standard error of the difference between two 
Z ’sis -f~ 

V N 1 — 3+N 2 —3’ 

where N x and N 2 are the numbers of pairs in the two samples. 
As usual, if the difference between the two z’ s is greater than 
twice its standard error, then the difference is significant. 

Example 20. Two groups of children, one of average age 11 
and the other of average age 14, are given an intelligence test 
and an arithmetic test, and the scores are correlated for each 
group separately. The numbers in the groups and the correla¬ 
tion coefficients were as follows: 

11 year olds, N 1 = 43, r x = 0*48, 

14 year olds, N 2 = 39, r 2 = 0*39. 

Can the correlation between intelligence and arithmetic be 
regarded as different in the two groups? 

From Table VI, 

z x = 0*523 and z 2 = 0*412. 

The difference is 0* 111. 

S.e. of difference = + = 0*230. 

V 40 ob 

1 Hence thb difference between the z’s is just less than half its 
standard error and so cannot be regarded as significant. 

6.x. Mean of several values of r. Use may also be made 
of the z transformation to obtain the average of several values 
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of r. Coefficients of correlation should not be regarded as 
ordinary numbers which may be added and divided, and due 
weight should be given to the number of pairs in each correla¬ 
tion, Values calculated from larger samples being more im¬ 
portant than valites from smaller samples. ' 

To average several values of r, first transform each r into z 
and then multiply each z by N — 3, where N is the number of 
pairs in the original r . Sum these products and divide the total 
by the sum of the (N — 3)’s, giving the mean value of z . Finally 
transform this mean z back into r, and this will be the correct 
value for the mean of the original correlation coefficients. The 
calculation may be conveniently tabulated as in the following 
example. 


Example 21. The same two variables are correlated in three 
different groups. The numbers in the groups and the values of 
r are given below. What is the average correlation in the three 
groups? 


N r 

Group 1 23 0-41 

Group 2 28 0-35 

Group 3 35 0-50 


Tabulate the 

work as follows: 



r 

z 

N -3 

(2V-3)z 

0-41 

0-436 

20 

8-720 

0-35 

0-365 

25 

9-125 

0-50 

0-549 

32 

77 ■ 

17-568 

35-413 


35-413 „ 

Mean z = ——— = 0-460. 
77 


F<prz = 0-460, r = 0-430 (from Table VI). Hence the average 
correlation in the three groups is 0-43. • 

The significance of an average r may be tested as though it # 
had been calculated* from S(N — 3)+ 3 pairs of observations. 
In the above example th6 average r may be tested for signi¬ 
ficance as though it had been a single r calculated from 80 pairs. 
It is therefore definitely significant, although of its component 
f *s only that in group 3 is .significant by itself. 
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6.*i. Partial correlation. In some dases it seems prob¬ 
able that two variables X and Y are correlated partly on ac; 
count of the fact that each of them is correlated with a third 
variable, Z. For instance, it may be that there is a correlation 
between scores in an arithmetic examination and scores in a 
Latin examination, partly because ability to do both arith¬ 
metic and Latin is correlated with intelligence. In such a case 
we may wish to find the correlation between X and Y quite 
apart from the influence of Z. This may readily be done by the 
method of partial correlation. All we need to know is the three 
correlations, viz. r XY , between X and Y , r xz between X and Z 
and r Y z between Y ancl Z. The correlation between X and Y 
with the influence of Z removed is then given by the formula 


„ _ r XY~~ r XZ‘ r YZ 

XT Z ~ Vi - “ 


vi- 


.2 


.(17) 


XZ v A — / YZ 

The symbol r XY z is read The correlation between X and Y 
keeping Z constant*. 

A table giving the values of (1 — r 2 ) for all values of r to 3 
places of decimals may be found in Tables for Statisticians and 
Biometricians, Table VIII, p. 20 (Ref. 3). 


Example 22. Three tests, A, B , and C , were given to a group 
of students and the three sets of scores were correlated with 
each other, giving the following coefficients: 

r AB == 0*66; r AC = 0-60; r BC — 0*40. 

What is the correlation between A and B keeping G constant? 
From formula (17), 


0-66-0 -60x0-40 
r AB.a " Vr^O-36 Vl — 0-16 


0-4_2_ 

0-7328 


= 0-57. 


The above formula and method may bp extended to four or 
more variables. The formula for four variables is given below: 
the student is unlikely to need further extensions. 


r AB.CD ~ 


r 4M'C ~ t a d. c - r bJ).c_ 


.(17A) 
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It will be seen that this partial correlation, the correlation 
between A and B keeping both G and D constant, requires the 
calculation of three other partial correlations, each keeping 
only one variable constant. ^ 

EXERCISES ON CHAPTER VI 

16. Calculate the coefficient of correlation by the product-moment 
method between: 

(a) subjects 1-25 in tests C and D , 

(b) subjects 26-50 in tests C and D , 

(c) subjects 51-75 in tests C and D f 

(d) subjects 76-100 in tests C and D. 

Use formula (14 A). 

17. Check the results of Exercise 16 by using the method of 
Section 6.v and formula (14B). In each case take 25 as arbitrary 
origin for test C and 30 as arbitrary origin for test D . 

18. Calculate the coefficient of correlation between each test from 
A to O inclusive and each other one for the whole 100 subjects.' 
Employ the method of the correlation table, Section 6. vi, using formula 
(14C) and the grouping adopted in Exercise 3. Repeat using the 
method of diagonal summation and formula (14D). 

19. Using formula (15), calculate the standard error of the values 
of r obtained in Exercise 18 for the correlations between test F and 
the other tests from A to O . Hence determine the significance of these 
values of r and chock by reference to (a) Table V and (6) the graph in 
Appendix B. 

20. Using the correlation coefficients given in the answer to 
Exercise 18, determine whether r BE is significantly different from 
r CE , t qe> r EF> r FO* r CF and r DF- U se the method of Section 6.viii and 
Table VI. 

21. From the correlation coefficients given in the answer to Exercise 
18, calculate the following partial correlation coefficients: 

(а) between D and F keeping C constant; 

(б) between C and F keeping D constant; 

(c) between F and G keeping E constant; # • 

(< d) between E and F keeping O constant; # 

(e) between C and E keeping B constant. 
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7.i. The ranking method. It sometimes happens that the 
actual values of two variables cannot be accurately measured, 
although we are able to rank them in order of size or merit. In 
such a case the method of product-moment correlation cannot 
be applied, but an approximate coefficient of correlation may 
be calculated. If N pairs of variables are ranked, X and Y 
being ranked separately of course, and d represents the dif¬ 
ference between the ranks of X and Y for any one pair, then 
the coefficient of ranked correlation is given by the formula 


P = 


6 Z(d 2 ) 

~N(N*~iy 


(18) 


This formula may be used whether the distributions of X and 
Y are normal or not, but it only gives an approximate indica¬ 
tion of the association between the two variables and should 
never be used if it is possible to calculate r . The coefficient 
should not be employed in partial correlation, multiple corre¬ 
lation, factorial analysis or any other statistical process which 
is based on product-moment correlation. 

The method of ranked correlation is frequently employed 
because the calculation involved is simple for small samples. 
The method of calculation is given here for use in cases 
where the data are inadequate for product-moment corre¬ 
lation. 

The first step is to rank each variable, calling the be^t or 
biggest value 1, the second best or biggest 2, and so on. When 
two or more values of a variable are the same it is usual to give 
.each the average rank. For instance, if there are two equal 
values for the 6th place, each is ranked as 6J, the next rank 
being 8, since these two will have occupied the 6th and 7th 
places between them. Similarly, if there aVe three equal values 
for the 10th place, each will be rapked ac 11, since they will 
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occupy the 10th, lPth and 12th places between them, and the 
next value will be ranked as 13. By this means, if there are N 
pairs of variables, each variable will be ranked from 1 to N. 
(This averaging of ranks introduces a further inaccuracy into 
formula (18), for ftie formula assumes that each ranking is dif¬ 
ferent: it is, however, frequently impossible to avoid it in using 
this method.) 

The pairs of ranks are written down in two columns and a 
third column, headed d , formed by subtracting entries in the 
second column from corresponding entries in the first. If the 
correct signs are put down in this column, the total of the 
column, or £(d), should be zero. Each entry in the third 
column is then squared, yielding a fourth column headed d 2 . 
Finally, this column is summed to obtain 2’(d 2 ), which is then 
substituted in formula (18). 

In Appendix C are given values of the reciprocal of 
6/N(N 2 — 1) for values of N from 10 to 60. The use of this, as 
explained in the Appendix, will save the student a good deal of 
arithmetic. 

The coefficient of ranked correlation has no standard error 
such as that usually calculated for r. A method of assess¬ 
ing the significance of p has, however, been worked out by 
M. G. Kendall and others: the method is too advanced for 
inclusion in this book, but the more mathematical student 
is referred to the following two papers: 

(1) ‘The distribution of Spearman’s coefficient of rank corre¬ 
lation in a universe in which all rankings occur an equal 
number of times.’ M. G. Kendall, Sheila F. H. Kendall and 
B. Babington Smith. Biometrika , Vol. xxx, p. 251, Jan. 

1939. 

• 

(2) ‘The problem of m rankings.’ M. G. Kendall and B. # 
Babington Smith. Ann . Math. Stats. Vol. x, No. 3, Sept. • 

1939. • . 

# 

Example 23. A class of*15 schoolboys was given an intel¬ 
ligence test and the*master provided t’Aeir order of merit in an 
entrance* examination. W»hat is the correlation between the 
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test ^nd the entrance examination? Below are the marks and 


order of merit of each boy. 


Boy. 

A 

B 

C 

Order of merit 

1 

2 

3 

Intelligence 
test score 

22 

19 

6 

Boy. 

I 

J 

K 

Order of merit 

9 

10 

11 

Intelligence 
test score 

15 

12 

10 


D 

E 

F 

O 

H 

4 

5 

6 

7 

8 

18 

20 

* 16 

11 

9 

L 

M 

N 

O 


12 

13 

14 

16 


7 

13 

12 

8 



Here one of the variables is ranked for us and may be written 
straight down. As regards the marks for the test, boy A scores 
most and is given the rank 1; E is next and is ranked 2; B is 3, 
and so on, finally giving us the second column. 



Order of merit 

Test 



Boy 

rank 

rank 

d 

d 3 

A 

1 

1 

0 

0 

B 

2 

3 

- 1 

1 

G 

3 

15 

-12 

144 

D 

4 

4 

0 

0 

E 

5 

2 

3 

9 

F 

6 

5 

1 

1 

G 

7 

10 

- 3 

9 

H 

8 

12 

- 4 

16 

I 

9 

6 

3 

9 

J 

10 

00 

H 

2i 

K 

11 

11 

0 

0 

L 

12 

14 

- 2 

4 

M 

13 

7 

, 6 

36 

N 

14 

8J 

5J 

30J 

O 

15 

13 

2 

4 




22 

265J 




-22 



6x265-5 
15 x 224 


0 

1- 0-474 = 0-526. 


* 7.ii. Bf serial correlation. Observational data sometimes 
make it impossible to calculate either product-moment or 
ranked correlation coefficients. Bor example, a test may be 
applied to a group of Subjects about whom the only other in¬ 
formation we possess is that each of them«has either passed qr 
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failed a particular examination. In such a case, after making 
two assumptions, we may calculate a form of correlation 
coefficient known as the coefficient of biserial correlation. This 
we shall denote bis. r. The two necessary assumptions are: 

(1) that the dichotomous variable is normally distributed, 
and 

(2) that the regression of X on Y is linear (see Chapter vra). 
If both these assumptions are deemed j ustifiable and if we have 
more than 80 observations, so that the data may be grouped, 
the bis. r coefficient may be calculated as follows. 

Suppose X is the numerical variable, i.e. the test, and Y the 
dichotomous variable, i.e. the variable divided into only two 
parts. Choose a convenient group unit for X and divide the 
observations into the ajjpropriate groups. Then construct a 
two-row, or biserial, table showing how the subjects who pass 
or fail in Y fall into the X -groups. Such a table might appear as 
under: 

Test X 


-5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 Total 

Pass 0 1 0 4 599 10 63422 55 

Fail 1 0 2 6 79 11 422010 45 

Total 1 1 2 10 12 18 20 14 8 5 4 3 2 100 


As in calculating the standard deviation, we choose an arbi¬ 
trary origin for X and number off the groups on both sides, as 
in Section 2.iii. This has been done in the table above. 

Now let us call the portion of the x’s which fall into the larger 
part of the Y distribution, x v Then the mean of this row (which 
will be the Passes in the above table) will be x x . The mean of all 
the x’s will be x and their standard deviation or x . All these 
statistics may be calculated, in working units, from the table. 
The corresponding statistics for Y cannot be calculated di¬ 
rectly, but assuming that we knew them, the coefficient of 
biserial correlation wpuld be given by the formula 

• x x — x ♦ 

bis. r = . .(19) 

Vi-V 

. * * . O’y 




1 62 OTHER METHODS OF CORRELATION [7.ii * 

Although we cannot calculate ——- directly from the table, we 

may obtain it indirectly (on the above assumptions) from data 
given in Table II of the Tables for Statisticians and Bio¬ 
metricians (Ref. 3). This table gives a quantity £(l-fa)* and 
the values of a function z corresponding to them. (This z is not 
to be confused with the z used by Fisher as a transformation 
for r, as in Section 6.viii.) The quantity ^(1 -ha) in our case is 
equal to nJN, where n 1 is the number of observations in the 
larger portion of Y and N is the total number of observations 
in the whole table. This is readily calculated and the value of z 
corresponding to it may be found by interpolation. (The 
method of doing this may be best understood from Example 23.) 
We then have 

Vi-y = z 
cr y nJN * 


The coefficient of biserial correlation is therefore obtained by 


calculating 


x 1 — x 


from the two-row table and dividing this by 


the value of ——- obtained from the statistical tables. 

(Ty 

The sign of bis . r has to be determined by inspection. If, for 
instance, the mean x score of the Passes is greater than that of 
the Failures, then the correlation is positive. 

This method of correlation should not be resorted to if it is 
possible to avoid it. In most cases it give3 little more informa¬ 
tion than would be acquired by investigating the significance 
of the a;-score means difference of the two F-groups. In any 
case, the ordinary standard error of r does not apply to bis. r, 
and caution is needed in interpreting the coefficient. (The 
standard error of bis . r is known only to a first approximation 
and the student who wishes to look further into this matter is 
€ referred to H. E. Soper, Biometrika , Vo], x, p. 384, 1914.) 

The method of calculation is shewn in the example below. 


* If a vertical is drawn at any point as in a normal curve, the total 
area is divided into two unequal portions, if x deviates from the mean. 
£(1 -fa) is the area of the greater portion. ® c 
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Example 24. Calculate the coefficient of biserial correlation 
for the two-row table given above. 

* The total column at the foot of the table gives the grouped 
frequency distribution of x , so that x and cr x may be calculated 
by the method of Section 3.viii. * 


/ 

£ 

/« 

fx* 


1 

-5 

- 5 

25 


1 

— 4 

- 4 

16 


2 

-3 

- ,6 

18 


10 

-2 

— 20 

40 


12 

-1 

-12 

12 


18 

0 

-47 

0 

97 

£( = £)=— = 0-! 

20 

1 

20 

20 

100 

14 

2 

28 

56 


8 

3 

24 

72 


6 

4 

20 

80 

V 100 

4 

6 

20 

100 

= 2-347. 

3 

6 

18 

108 


2 

7 

14 

98 


Too 


~144 

645 




-47 





~~97 




The mean x 1 is obtained in a similar manner by multiplying 
x by the entries in the top row: these we may caU/ x . 


h 

0 

1 

0 

4 

5 
9 
9 

• 10 

6 

3 

4 
2 

_2 

65 


x 

-6 

— 4 
-3 

— 2 
-1 

0 

1 

2 

3 

4 
6 
6* 
7 


fix 

0 

- 4 
0 

- 8 
- 5 

—TT 

9 

20 

18 

12 

20 

12 

14 

105 

-17 

88 


. _ 2V 1 x)_88_ 
1 55 
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From the biserial table, therefore, ' 


x L — x 

~°X 


1-6-0-97 

2-347 


0-2684. 


[7.ii- 


r' 


Now we need to calculate —— from tlie tables in Ref. 3. 

n 1 = 66 and N = 100. Hence 

i(l+a) = 55/100 = 0-55. 

From the table we find the following values of z corre¬ 
sponding to two values of £(1 + a): 

for |(1+a) = 0-5477584, z = 0-3960802, 
for |(1 + a) = 0-5517168, z = 0-3955854. 

Our value of |(1 + a) lies between these two. Now the dif¬ 
ference between the two values of |(l+a) given above is 
0-0039584. This corresponds to a difference in z of 0-0004948. 

Our value of |(1 +a) is 0-55 — 0-5477584 = 0-0022416 above 
the smaller of the two values given above. It will be noticed 
that as £(1 + a) gets larger, z gets smaller: hence we have to 
subtract from the first z that part of the difference between the 
two z’s which is proportional to 0-0022416/0-0039584, i.e. the 
value of z corresponding to |(1 +a) = 0-55 is 

0-3960802-0-0004948 x 0-0022416/0-0039584 
= 0-3960802-0-0002802 
= 0-3958000* 


Hence 


Therefore 


Vi-y = 2 
<r y nJN 


0-3958 

0-55" 


= 0-7196. 


bis. r = 


x x — x 

£x 

Vx -y 


0-2684 

0-7196 


0-373. 


> 


i This coefficient must be positive since x 1} the mean test score 
of the Passefc, is larger than x, the mean of all the subjects, and 
so must be even larger than the mean of the Failures alone. 

* This method of linear interpolation is not strictly applicable to 
the table but is sufficiently approximate for the present purpose. * 
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Hence the Passes have a better score than the Failures and the 
relationship between* the test scores and the examinational 
sr^cess is a positive one. 

7.iii. # Fourfold correlation. When both variables are di¬ 
chotomous the methods previously described cannot be 
applied. There are various methods of calculating fourfold or 
tetrachoric correlation coefficients, but their use is not advised. 
The association between two dichotomous variables is best in¬ 
vestigated by the method described in Section 9.iii, p. 84. 

EXERCISES ON CHAPTER VII 

22. (a) Using the ranking method (formula (18)), calculate the 
coefficient of ranked correlation between the order of merit H and 
tests C and D for the subjects 1-25. (Rank the smallest order of merit 
as 1 and the largest test score as 1.) 

(b) From the rankings of the subjects obtained above, calculate 
the coefficient of ranked correlation between tests C and D for sub¬ 
jects 1-25. Compare this with the value of r obtained in Exercise 16(a). 


CSC 


5 



Chapter VIII 

REGRESSION AND THE CORRELATION RATIO 


8.i. When we are considering the correlation between two 
variables, X and Y , we may draw a graph showing the mean 
values of y for regularly increasing values of x. This graph will 
be an irregular line which is called the observed regression line 
of y on x. Similarly the observed regression line of x on y is an 
irregular line showing the mean values of x for regularly in¬ 
creasing values of y. These lines indicate the law of change in 
the mean of one variable for unit change in the other, and if the 
lines are straight the regressions are said to be linear. 

Usually, owing to errors of sampling, the observed regression 
lines are rather irregular, but it may be possible to ‘ fit ’ straight 
lines to them and to show mathematically that the observed 
regression lines do not depart significantly from the fitted 
straight regression lines. 

A straight line has an algebraic equation which represents it 
in symbols; hence linear regression lines may be represented by 
the following equations: 

(y-y) = r^(x-x), .(20) 


(x—x) 


{y-y)- 


.(20 A) 


The former is the regression straight line of y on x and the latter 
the regression straight line of x on y. In these equations r is the 
coefficient of correlation between X and Y. 1 

The angles these lines make to the horizontal and vertical 

respectively are measured by the expressions 
« •* 

* <Ty 

r- 1 and r —, 


and these are called the coefficients of regression . 





6.i] REGRESSION JLND THE CORRELATION RATIO 67 
\ » 

If r = 1, the two regression equations are identical and the 
regression straight lihes coincide. If r == 0, the two lines* are 
horizontal and vertical respectively and cross at right angles. 
The lines cross for any intermediate value of r, so that the 
larger the value of r, the smaller is the acute angle between 
them. The t wo links always cross at the point x, y on the 
graph, i.e. the point indicating where the means of x and 

y lie. 

In any fairly small sample the successive observed means of 
one variable for different values of the other are unlikely to fall 
exactly on a straight line,*but if the regression does not depart 
significantly from linearity it is possible to draw a straight line 
which passes very nearly through the observed means. As an 
example, the regression straight lines of the data in Example 
17, p. 48, will be drawn and the discrepancies between the 
observed regression lines and these may be observed. 

At the right-hand side of the correlation table in that ex¬ 
ample we find two columns headed f y and T xy respectively. If 
we divide the entries in the second of these by the corresponding 
entries in the first, we obtain the mean values of x corresponding 
to each y-group. In Example 17 these are as follows: 


y 

fy 

T x . v 

T 

X-V 

7/ 

-5 

1 

- 6 

— 6-00 

-4 

1 

- 3 

- 3-00 

— 3 

2 

- 0 

- 300 

-2 

10 

- 10 

- 1-00 

-1 

* 12 

- 8 

— 0*07 

0 

20 

- J6 

-o-so 

1 

22 

— JO 

— 0-45 

2 

12 

3 

0-25 

3 

8 

8 

1’00 

4 

5 

6 

1-20 

5 

4 

11 

2-75 

6 

3 

10 

3-33 


The last column gives the observed line of regression of x on y. • 
In like manner the observed line of regression of y on x may be 
obtained by calculating a T y x column from the correlation table 
and dividing the entries in it by the corresponding entries in the 
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# 

f x column. This gives us for the observed regression of y 
on a>: ' 


X 


2 V . 

v.x 

f* 

— 6 

1 

-5 

-600 

-6 

0 

— 

( —* 

-4 

3 

-6 

-200 

— 3 

7 

-3 

-0-43 

— 2 

8 

2 

0-25 

-1 

21 

0 

000 

0 

30 

8 

0-27 

1 

15 

30 

200 

2 

9 

29 

3-22 

3 

3 

10 

3-33 

4 

2 

10 

500 

5 

1 

6 

600 


* Note that there are no observations in the x = — 5 column. There 
will therefore be no entry in the last column for this group. Care must 
be taken not to record the entry as 0*00, as this would be taken as a 
point on the observed regression line. 

We now proceed to plot these points on a graph. Since we are 
working from arbitrary origins in this example, the two axes 
will be at right angles, crossing at the point x = 0, y = 0. The 
x line will be horizontal and we mark off along it equal divisions 
corresponding to the a;-groups. Those to the right of the origin, 
or point where the axes cross, will be positive and those to the 
left negative, and they are numbered accordingly. Similarly in 
the case of y , divisions above the origin are positive and those 
below negative. 

In Fig. 4 the points on the observed regression lines are 
plotted, those for the regression of x on y being marked by 
crosses and those for the regression of y on x by circles. The 
regression straight lines themselves, AA and BB, are also 
drawn. It will be seen that the crosses mostly lie closely about 
the line AA and the circles about the line BB. These two lines 
cross at the point where x = —0-21 and y = 0*81, which were 
' the mean values of x and y in working units found in Example 
17. The acu'ce angle between the lines indicates a fairly high 
correlation between x and y; it was found in fact that 
r = 0*666. 
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The lines AA and BB are drawn as follows. The regression 
coefficients are first calculated from the data r = 0*666, 
? x = 1*818 and cr y = 2*181. 



Hence r- 5 = 0-666x1-818/2-181 = 0-5551 

a V 

and, r^~ = 0-666x2-181/1-818 = 0-7990. 

Substituting these in the regression equations, (20) and (20A), 
we get 

(: y-y )•= o*7990(z-z), 
and (x — x) =* 0*5551 (y — y). 

Now we know that z = — 0-21 and y == 0 * 81 . 
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Hence we have 


1/-0-81 = 0-7990 (x + 0-2l) 


and 

whence ( 
and 


z +0*21 = 0*5551(2/-0-81), 
y = 0*7990z+ 0*9778 
x = 0-55512/-0*6596. 


By substituting different values of x and y in these equations 
and calculating the corresponding values of y and x } the fitted 
regression lines may be plotted exactly. Two points are enough 
to give each line. \ 

In the first equation, for example, 

if # = — 5, y = -3*0172 
and if x = 5, y — 4*9728. 


These two points fix the line BB. 

Similarly the line A A is obtained from the second equation: 

if 2 / = — 5, x = — 3*4351 

and if y = 5, x = 2*1159. 


8.ii. Since the method of product-moment correlation is 
usually appropriate only in those cases where the regressions 
of the two variables on each other are linear, it is important to 
be able to ascertain whether in fact the regressions are linear in 
any particular case. Often the graphic method exemplified 
above suffices, but in cases of doubt (and always more satis¬ 
factorily) the linearity of regression may be tested mathe¬ 
matically. The process involves the only example of what is 
called the ‘Analysis of Van^ce’ which may be dealt with in 
this book. Students who wish for a fuller understanding of the 
underlying theory are referred to Fisher’s book (Ref. 2). 

Essentially the method is as follows. The student will re¬ 
member that the variance of x is obtained from the sum of the 
‘squares ofcthe deviations of each individual x from the general 
mean, x. This sum of squares may be split into two portions: 

G 

(1) the sum of the squares of the deviations of each x from 
the mean of its own array, summed for all arrays, and 
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(£) the sum of the squares of the deviations of the means of 
the arrays from th£ general mean, x. In the latter each* array 
mean must be weighted by the number of observations in it. 

This splitting of the variance may be shown symbolically. 
Let f x be the number of observations in a y-array corresponding 
to a particular value of x , and y x be the mean of the y’s in this 
array. It may then be shown that 

>%-y ) 2 - S(y-y x ) 2 +Z[f x (y x -yn 

where E means a sumnVttion for all different values of x . 

Now if there is a lineir regression of y on x , there will be a 
mean y for each array which may be calculated from the re¬ 
gression equation. The quantity E[f x (y x — y) 2 ] may therefore 
be subdivided into two portions, one of which is the sum of 
squares due to linear regression and the other the weighted sum 
of the squares of the deviations of the array means from the 
means calculated on the assumption of linear regression. If 
this latter portion of the variance is sufficiently large, it signi¬ 
fies that the means of the arrays differ significantly from lineal* 
regression, and the following section shows how the significance 
of this departure from linearity may be tested arithmetically. 


8.iii. In Example 17, ten columns were constructed on the 
right of the correlation table. In order to examine the linearity 
of the regression of y on x for the same data, we need to con¬ 
struct three further columns. The first of these is the T J/X 
column, as shown in Section 8.i. The second is obtained by 
dividing entries in this first column by corresponding entries 
in th ef x column: this is headed T yx Jf x . Entries in this second 
column are then multiplied by corresponding entries in the 
first to give a final column headed (T y x ) 2 /f x . 

From the sums of certain of these thirteen columns we have 
to calculate three quantities: 

.(i) Z[L(y x -m, 


(2) S{y~y)\ 



8\x — x) 2 
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These can be calculated as follows: ’ 1 


( 1 ) Z[f x (y x -m = £ 


o_w 

fx i ^ 1 


( 2 ) S{y'-yY = Z{f y y*)- { ^f?-, 


(3) [?(*-*)(y 


S(z — x) z 


[rwyy ) ^(/x*)-^(/^)T 

-y )] 2 L l (y * y jv 


Z(f x x*j- { ^ X) ? 

From these data we construct a table. In computing the 
number of degrees of freedom in each part of the variance, a 
is the number of y-arrays. 


A. Total sum of squares: S(y — y) 2 

B. Total sum of squares within arrays: 

(A-C) 

C. Total sum of squares between arrays: 

z[f*(y*-y) 2 ] 

D. Sum of squares due to deviations from 

linear regression: {C — E) 

E. Sum of squares due to linear regression: 

[£(#-£ ) ( y-y)] 2 
S(x — x) 2 


Degrees of 
freedom 
N- 1 

N-a 

a—l 

a — 2 

1 


Mean 

square 

A — C 
N-a 


C-E 
a —2 


In this table entries B and D are obtained by subtraction as 
indicated, B by subtracting C from A, and D by subtracting 
E from C. The entries in B and D are then divided by their 
respective degrees of freedom, giving the two entries in the 
column headed ‘Mean square’. 

*We next find the logarithms of the entries in the m/san 
< square column and multiply the difference between these 
logarithms by 1-1513 (to convert logarithms to the base 10 to 
Napierian logarithms and to divide by 2).- This gives us a 
quantity called z, which is tabulated in Table VI of Fisher’s 
book. (Unfortunately this is the third z we have had to use and 

* 

* See next section. 
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shoulSd not be confused with those of Sections 6.ix and 
7.ii.) 1 

TT 1 TKIO I ^— C* i 0 — E 

Hence z == 1-1513 log^-log-- . 

^ — a a — x 

It finally remain* to look up the value for z in Fisher’s table 
corresponding to the number of degrees of freedom we have. 
There are two values of n required, % and n 2 . The of the 
table is the number of degrees of freedom corresponding to the 
larger mean square, i.e. vdiichever of the entries B or D is the 
bigger, and n 2 is the number of degrees of freedom corre¬ 
sponding to the smaller mean square. 

If the calculated z is smaller than the z in Fisher’s 5% point 
table, then the regression of y on x does not differ significantly 
from linearity. 

The whole of the necessary calculation is shown in 
Example 25. It must be remembered that this is an 
examination of the regression of y on x only: the regression of 
x on y should also be examined by an exactly similar method, 
interchanging x and y in each formula. 

Example 25. Examine the linearity of the regression of y on 
x for the correlation table given in Example 17. 

For the sake of space the correlation table itself is not re¬ 
produced here. The first ten columns to the right of the table 
are copied from Example 17, p. 48, and were obtained as 
explained in the section preceding that example. 

From the calculations on page 74 we may construct the 
analysis table. The number of arrays, a, = 12. 


Degrees of Mean 
freedom square 


A. 

Total sum of squares 

475-39 

99 


B. 

1?>tal sum of squares within arrays 

227-3033 

88 

2-5830 

C . 

Total sum of squares between arrays 

248-0867 

11 


D. 

Sum of squares # due to deviations 

37-2467 

10 

3-7247 


from linear regression 




E . 

Sum of squares due to linear 

210-84 




regression 


*=1-1513 (log 3-7247 - log 2-5830) = 0-1830. 



74 REGRESSION AND THE CORRELATION RATIO |_8.iii— 


1 1 




!>■ 


CO 



CO 


t** 






© 


CO 


h* 

00 


© 


8 


1 


00 


CO 


TH 

CO 


© 


* ^ 

o 

1 

© 

CM 

to 

pH 

© 


00 

© 

© © 



lO 


CM 

pH 

6 o 

CM 

© 

CO 

cb 

6 

© cb 



CM 


pH 




© 

© 

co 

© 

CO pH 




f 









CO 






co 


t— 


CM 

00 




N I 




00 


CO 


CM 

00 




siL» 




CM 

© 

© 


CM 

co 





o 

1 

© 


CM 

CM 

© 

CM 

CO 

© 

© 



ib 

1 


CM 

1 

6 

1 

6 o 

© 

CM 

cb 

cb 

© 

© 

* 

H 












i 


lO 

1 

CO 

CO 

2 

0 

CO 

© 

© 

© 

© 

© I pH 

1 

Si 

1 

1 

1 

1 



00 

CM 

pH 

pH 

|=o 

i 

<M 














CO 

o 

00 

CO 

CM pH 

© 

10 

© 

r- 

CM 

10 I © 


. H 

CO 


TH 

CO 

CO CM 


pH 

CO 

CM 

CO 

CM CO 













|co 



CO 

o 

CM 

r-H 

CO pH 


© 

X 

© 

X 

© >0 

© 1 


1 


rH 

1 

CM 

l 

pH CM 

1 1 

M 

pH 

pH 



JO 

r> 

1 | 


CO 

1 

© 

1 

1 

CO 

1 

CM pH 
! 1 

© 

pH 

CM 

CO 


© 


H 

1—1 

o 

CO 

r> 

00 pH 

© 

© 

© 

CO 

CM 

PH I© 


V. 





CM 

co 

pH 




© 













1 pH 


<M 














»o 

CO 

00 

o 

CM O 

CM 

00 

CM 

© 

o 

X I pH 


* 

CM 

pH 

pH 

T* 

pH 

CM 

TH 

I> 

X 

© 

© ~H 












pH 

PH 1 IO 


Sss 

© 

H* 

CO 

o 

CM h- 

CM 



© 

© 

X I X 

*> i 


I 

1 

1 

Cl 

pH >+ 

CM 

CM 

CM 

CM 

CM 

pH CM 


V> 

1 

1 

1 

1 

1 1 






r 

1 1 

a 













*5 

o 

CM 

00 

o 

8 

0 

© 

© 

hH 

-H 

© 

© it- 


Si 

CO 


pH 

CM 




CM 

CM 

© 









1 





1 CM 


» 

CO 

CO 

CO 

o 

00 CO 

© 

CO 

X 

© 

pH 

© 1 pH 


Si 

1 

1 

1 

i-H 

1 

1 7 

pH 

1 




pH 

pH CM 

1 1 


5S> 

© 

tH 

Vo 

CM 

pH O 

pH 

CM 

CO 


© 

© 



1 

1 

1 

®l 

1 






1' 












c 


. Si 

1-4 


CM 

O 

CM © 

CM 

r* 

X 

© 


CO 1© 







pH CM 

CM 

pH 




© 













1 pH 



£[/,(37*-27) 2 ]= 313-6967- — = 248-0867. 
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We must n^w consult Fisher’s 5% point table of z. We have’ 
n x = 10 and n 2 = 881 * 

Now we see from the table that when n x = 12 and n 2 = 120, 
the 5% point is 0-3032, hence the value for n x = 10 and n 2 = 88 
must be bigger than this. Accordingly our calculated* value of 
z is much smaller than the 5% point and we conclude that the 
regression of y on x does not depart significantly from linearity. 

8.iv. The method of calculating z at the end of the previous 
section was included for phe sake of theoretical completeness. 
In actual practice the rather cumbrous calculation may be 
avoided by the use of a table given in Fisher and Yates’ tables 
(Ref. 5). Instead of working out the value of z we may calculate 
the variance ratio. In testing the linearity of regression by this 
method, what we wish to show is whether or not the variance 
due to deviations from linearity of regression is significantly 
greater than that within arrays. If the former variance is 
smaller than the latter, there is no need to make a test at all— 
there is no evidence of departure from linearity of regression.- 
(The student need not concern himself with the very rare case 
where the variance due to deviations from linear regression is 
significantly smaller than the variance within arrays.) 

The variance ratio in this case is the entry in the £ Mean 
square’ column in the D row divided by that in the B row. If 
this ratio is greater than 1 we must consult Table V for the 
5% point in the Fisher and Yates tables. For this purpose n x 
in the table is the number of degrees of freedom corresponding 
to the larger variance, i.e. that in the D row. If the calculated 
ratio is smaller than the appropriate entry in the table, then 
the regression does not depart significantly from linearity. 

Taking the data of Example 25, for instance, the variance 
rations 3-7247/2-5830 = 1-44. In this case % = 10andw 2 = 88. 
In Table Y of Fisher and Yates for n x = 12 and n 2 = 120 we 
find an entry of 1-83 for the 5% point. Our calculated variance 
ratio is definitely smaller fjhan this and so would bt* relatively 
even smaller than the one corresponding to the number of de¬ 
grees of freedom in our data! (The correct 5% point ratio could 
be obtained from the table by interpolation, but there is no 
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need for this.) We conclude, therefore, that the regression^does 
not depart significantly from linearity, which agrees with our 
findings in the previous section. 


8.V. The correlation ratio. If the regressions of the two 
variables on each other are non-linear, the degree of association 
between the variables cannot be measured by ordinary corre¬ 
lation. There might be a real relationship and yet the coefficient 
of correlation would be zero if the regressions were semicircular 
and symmetrical, for example. In yuch cases a measure of 
association may be obtained by c/lculating the correlation 
ratio . 

There are two correlation ratios for each pair of variables and 
they are denoted by rj yx and rj xy . The formulae for these are 
simple: 

- 2 Z[Uy*-y) 2 ] . (21) 


S(y-y) 2 


vl = ^ = 

U V 


2 

Vx " ~<r 2 x ~ 




S(x — x) 2 


,...(21 A) 


In these formulae <r £v signifies the standard deviation of the 
means of the z-arrays and a yx the standard deviation of the 
i/-arrays, so that y may be seen to be the ratio between the 
standard deviation of the means of arrays and the standard 
deviation of the whole sample. 

It will be seen that the requisite data for the calculation of 
rj yx have already been obtained in Example 25, and if the 
student has examined the linearity of the regression of x on 
y he will also have the data for calculating r / xy . 


Example 26. Calculate r/ yx from the data obtained in 
Example 25. 

From that example we see that 


Z[fziyx-y) 2 ] = 248-0867 and S(y-y) 2 = 475-39. 

< 


Hence 


, 248-0867 

Vvx ~ 476-39 


= 0-52186, 


Vvx = Vo-62186 = 0-722*4. 


whence 
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8vVi. The data necessary for the examination of the linearity 
of regression may now be expressed in an alternative form. We 
have 

Skt-vr = No*, 


Z[f x (y x -m = N V l x crl, 


[S(x-x ) (y-y)f 
S(x — x) 2 


= Nr 2 * 2 . 


Also the z of Section 8.iii may be expressed as 

j. I k* o i n „ (N—a) (i/ 2 — r 2 ) 

M5 g 31og-— 2 )-^-^y. 

If we substitute in this expression the values we have obtained, 
we find that 

, ..von 88 (0-7224 2 — 0-666 2 ) 
z=l-1513log 


= l-1513log 


10 (1 — 0-7224 2 ) 
6-89075 


4-7814 
= l-1513log (1-4412) 
= 0-183. 


This result is identical with the value of z calculated in 
Example 25. 

It may also be observed that when rj and r are equal in the 
above expression, z is zero, i.e. when the regressions are abso¬ 
lutely linear, both rf s are equal and equal to r. 


EXERCISES ON CHAPTER VIII 

23. (a) From the table used in Exorcise 18, examine the linearity of 
the regressions of tests F and G on each other. 

(6) Thence calculate the two correlation ratios for tests F and G . 



Chapter IX 

X 2 :CONTINGENCY: GOODNESS OF FIT 

9.i. The statistical methods already described have mostly 
been applicable to quantitative numerical data only, and 
usually only to data which are at any rate approximately 
normally distributed. It may happen, however, that the 
available data are qualitative or quan titative only in the sense 
that we know the number of cases falling into different cate¬ 
gories. In such instances the methods which have been ex¬ 
plained in the previous chapters cannot be used but there are 
other methods which are appropriate. 

These methods depend chiefly on a statistic known as x 2 * 
The mathematical derivation of this statistic is difficult and 
cannot be described here. However, the distribution of it has 
been worked out and tables are available showing the frequency 
with which different values of x 2 are exceeded and also the 
values of x 2 corresponding to particular frequencies. Reference 
to the appropriate tables will be made later in this chapter. 

Use may be made of x 2 in the investigation of a number of 
different problems, but the calculation of it is essentially the 
same in each case. If 0 is the observed frequency in a particular 
category into which a variable may fall and E is the frequency 
which would be expected to fall in that category on some 
hypothesis, then x 2 may be found by dividing the square of the 
difference between 0 and E by E and summing these quotients 
for all categories into which the variable falls. In symbols, 

.<**> 

This is the general formula and the calculation of it in specific 
cases will be described in due course. 

i * 

Having found x 2 we need to know the number of degrees of 
freedom available for f calculating it in each particular case 
before we can make use of the x 2 tables. Rules for finding the 
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nuAber of degrees of freedom will be given in each instance. 
Consultation of thfe tables # will then indicate the probability, 
of a* calculated value of x 2 being exceeded as a result of 
randpm sampling. If this probability is less than 0-05 (i.e. 
19:1), then x 2 ma y be regarded as showing that th© observed 
data depart signfccantly from the hypothesis which is being 
examined. This hypothesis may be that a variable has a par¬ 
ticular type of distribution or, more frequently, that there is no 
association between two variables. This latter hypothesis, 
known as a null hypothesis, assumes that two variables are not 
associated: if we can di^rove a null hypothesis, it follows that 
the two variables must be associated. Such a procedure is usual 
in the investigation of certain problems of association which 
will now be described. 

9.ii. Contingency. Suppose x and y are two variables 
which are not measurable numerically but which can each be 
divided into two or more categories. An example of this might 
be hair colour and eye colour, for instance. Let the different 
categories of x be x v x 2i rr 3 , etc., and those of y be y v y 2 , y z , etc. 
We may then construct a contingency table showing the number 
of aq’s which fall into the y x ,y 2 , y& etc. categories, and so on. 
This will give us a sort of small correlation table. An example 
is shown below for 5 ^-categories and 4 y-categories. 



x x 

*2 

X* 

X \ 

X 5 


2/i 

• 





n «, 

2/2 






n v. 

2/3 






n v. 

y* 








n Xl 

n x t 

n *. 

n *. 


N 


It will be seen tEat the table is divided into 5 x 4 = 20 recl- 
angles or cells . The total frequency in each ^-category is given 
at the foot of the columns and will bo nn Xt etc. Similarly the 
total frequency in each ^-category is given at the end of the 
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horizontal rows and will be n y ^ n y% , etc. The total4requencJy of 
each yariable, or pairs of variables, will c as usual be N. In 
general, if there are r rows and c columns, there will be r x* 
cells. Each cell may be referred to by the x and y categories 
into whicfy it falls: e.g. the cell on the third row down and in the 
second column from the left may be referred (\o as the x 2 , y s cell. 

The contingency table is completed by filling in the observed 
frequency in each cell. For instance, we count how many ob¬ 
servations in the x x category are also in the y 1 category and 
write down the number in the x v y 1 cjfll, and so on. From the 
completed table we may calculate a (coefficient known as the 
coefficient of mean square contingency and denoted by C . To do 
this we have first to calculate the frequency which would be 
expected in each cell on the null hypothesis, i.e. on the assump¬ 
tion that the two variables are not associated with one another. 
This is done quite simply as follows; if n Xc is the total frequency 
in the x c column, and n Vr the total frequency in the y r row, then 
the frequency that would be expected in the x c , y r cell on the 
null hypothesis is 

n Xc x n vr 
~N 

By substituting each of the values of c and r in turn, we obtain 
the expected frequency in every cell. 

The next step is to subtract the expected frequency from the 
observed frequency in each cell: the resulting values of (O — E) 
are written in each cell with due regard to sign, and the arith¬ 
metic may be checked at this stage by observing that the total 
of (0 — E) is zero for each row and each column. 

Next square (0 - E) and divide the square by E for each cell. 
This gives a series of quotients which is conveniently written 
down on the right of the table: there will be r x c such quotients. 
# Finally this column of figures is added, giving % 2 , 

, From this the coefficient of contingency, C, is given by the 
formula 

f, 


f • 

[Sometimes an intermediate step is inserted in the definition 
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of C. x 2 is called th6 * square contingency’ and if we divide it 
by N we-get the ‘mean square contingency’, which is denoted 
by <f> 2 * Then 

* 

If the null hypothesis is correct, 0 and E will be equal for 
each cell (apart from errors of sampling), so that x 2 will be zero 
and C will also be zero. It is evident from formula (23) that C 
can never quite equal unjlty. The actual maximum value of G 
depends on the number «j>f cells, so that for a 2 x 2 table, for 
example, the maximum possible value of C is 0*707, and for a 
10 x 10 table the maximum for G is 0*949. It is obvious, there¬ 
fore, that a value of C obtained from one contingency table 
cannot be directly compared with that from another, unless 
the number of rows and columns in the one is equal to that in 
the other. Hence in reporting a value of G the number of cells 
in the table should always be mentioned. Moreover, C is always 
positive, so that the nature of the association has to be de-. 
termined by inspection of the table. It may be seen, therefore, 
that C itself is not a very useful coefficient. Of much more use 
is the information to be derived from x 2 ‘ 

The standard error of C is exceedingly complex and can only 
be interpreted for very large samples. Accordingly it is seldom 
used and instead we find from the x 2 table the corresponding 
value of P, which gives the probability of our calculated value 
of x 2 being exceeded as the result of random sampling. There 
are two usual methods of doing this, making use of a table 
given by Fisher or one due to Elderton. Fisher’s method is 
probably the more convenient and will be described first. 

(1) First we need to know n, the number of degrees of free- 
dom*available for calculating x 2 - I* 1 the case of a contingency 
table of r rows and c columns this is given by n = (r — 1) (c — 1). 
We may then consult Fisher’s table, of which an extract is 
given below, noting the value of x 2 appearing againshthis value' 
of n in the column headed JP = 0*05. If the calculated value of 
X 2 is greater than that given'in the table, then it is significant, 
and the null hypothesis is disproved, i.e. there is a significant 
aesociatioif between the variables. 


. 123 A » 


QSC / 


6 
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An extract from Fisher's table (Ref. 2,^Table III) or Fisher 
and Yates (Ref. 5, Table IV) for P = 0*05 is given below. r 


Table VII. Table of x 2 for different values of n° 
P = 005 / 


n 

X 2 

n 

X 2 

n 

X 2 

1 

3*841 | 

11 

19*675 

21 

32*671 

2 

5*991 

12 

21*026 

22 

33*924 

3 

7*815 

13 

22*362 

- 23 

35*172 

4 

9*488 

14 

23*685 

24 

36*415 

5 

11*070 

15 

24*946 

25 

37*652 

6 

12*592 

16 

26*296 

26 

38*885 

7 

14*067 

17 

27*587 

27 

40*113 

8 

15*507 

18 

28*869 

28 

41*337 

9 

16*919 

19 

30*144 

29 

42*557 

10 

18*307 

20 

31*410 

30 

43*773 


(2) To make use of Elderton’s table, which is given in Tables 
for Statisticians and Biometricians (Ref. 3, Table XII), we need 
to know n f , which is one more than the number of degrees of 
freedom, i.e. n' = (r — 1) (e —1)4-1. Different values of n f are 
given at the heads of columns in Elderton’s table and integral 
values of x 2 listed at the side. By looking up the value of P 
in the appropriate n f column opposite the calculated value of 
X 2 , we obtain the probability of as great a value of x 2 or greater 
occurring as a result of random sampling; if this value of P is 
less than 0*05, x 2 is significant. Since the fisted values of x 2 & ve 
integral, it is usually necessary to interpolate to find the value 
ofP. 

The use of both tables is illustrated in Example 27. 

In using the method of contingency there are two provisions 
to be borne in mind. Firstly, E , the expected frequency, must 
be at least 5 in each cell; secondly, the table should if possible 
contain at least 5 rows and 5 columns. The former provision is 
the more important. 

*■ Example 27. Two examiners assessed thfe intelligence of 200 
students, ofie by a verbal test and the other by a performance 
test. Each graded the intelligence as Very Good, Good, Fair 
or Poor. The relatioit between the two* sets of judgments is 
shown in the contingency table below. Calculate the coefficient 
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of contingency and examine whether the relationship between 
the two judgments can be regarded as significant or not*. 


Examiner 2 


Examiner 1 



v.g. 

G. 

F. 

• P. 

v.g* 

19 

10 

8 

3 

G. 

7 

40 

9 

4 

F. 

8 

20 

23 

19 

P. 

0 

8 

12 

10 


First we construct ?. contingency table, leaving room for 
three entries in each cell. Then in each cell we enter the ob¬ 
served frequency, the expected frequency and the difference 
between the two, thus: 


O — E 
O 

E 



V.G. 

G. 

F. 

P. 

(O-E)J 

V.G. 

12-2 

19 

6*8 

-5-6 

10 

15-6 

,-**; 
10-4 

-4-2 

3 

7-2 

E 

„„ 0 21-888 

40 2-010 

40 0-554 

G. 

-3*2 

7 

10-2 

16-6 

40 

23-4 

-6-6 

9 

15-6 

-6-8 

4 

10-8 

2-450 

fin 1 ' 004 

60 M 11-776 

60 2-792 

F. 

-3-9 

8 

11-9 

-7-3 
20 • 

27-3 

4-8 

23 

18-2 ! 

6-4 

19 

12-6 

o 4281 

Vo 1,278 

70 1-962 

' u 1-266 

P. 

• 

-51 

0 

51 

-3-7 

8 

11-7 

4-2 

12 

7-8 

4- 6 

10 

5- 4 

o 3 251 

30 6 ' 10 ° 

30 1,170 

2*262 


0 

34 

34 

0 

78 

• 78 

0 

52 

52 

0 

36 

36 

3-91» 

200 X s = 66,962 ' 


• _ __ 

/ x 2 /~66-952_ 

V N + 'x 2 ~ a J 266-952 


= 0*501. 

6-3 
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* r a 

Along the bottom and at the side of the table the arithmetic 
is checked by showing that the tota] of the expected frequencies 
for each row and each column is the same as the total of the 
observed frequencies, and also that the sum of (0 — E) is zero 
for each row and each column. Then for each cell we calculate 
(0 — E)*/E and tabulate the values obtained on the right of the 
table. The sum of these is x 2 - The remainder of the calculation 
of C is shown below the table. 

We shall assess the significance of this result by both 
methods. 

(1) Using Fisher’s table: 

n = (4-1) (4-1) = 9. 

From Table VII, for n = 9, x 2 = 16*919. The calculated value 
of x 2 is much bigger than the value in the table, hence the dis¬ 
tribution departs significantly from independence, i.e. there is 
a significant relation between the two sets of judgments. 

(2) Using Elderton’s table: 

n ' = (4— 1) (4— 1) + 1 = 10. 

For n ' = 10, opposite x 2 = 60 we get a value of P = 0*000000. 
This means that a value of x 2 as big or bigger than the one 
calculated would arise as a result of random sampling less than 
once in a million times. Accordingly there is no doubt about 
the significance of the relationship between the two sets of 
judgments. 

Note that Fisher’s method is the easier for proving whether 
or not there is a significant relationship in a single table. If we 
wish to compare the significance in two or more tables we need 
to know the value of P for each; in this case we have to use 
Elderton’s table or else the complete x 2 tables given in Fisher’s 
book. ^ 

9.iii. 2x2 tables. A special form of contingency arises 
when both variables are dichotomous, i.e. each variable can be 
divided int6 only two classes. The association between such 
variables may be shown by constructing a contingency table 
with only four dells, since there will be only two columns and 
two rows. We shall now examine the significance of a fourfold 
or 2 x 2 table making use of the x 2 method. m 
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In order to simplify the notation we shall denote tlje ob¬ 
served frequencies in the fotir cells by a, 6, c and d as under. 



• 

*2 


Vi , 

a 

b 

a + b 

y% 

c 

d 

c + d 

• 

a + c 

b + d 

N 


It will be seen thatY = a + b + c + d. 

The value of x 2 may be determined in the same way as that 
used for any contingency table, but the arithmetic may be 
shortened by using the formula 

va =_ (ad-bc^N .. 

K (a + c)(b + d){c + d)(a + b) K ' 

The denominator of this will be seen to be the product of the 
totals of the rows and columns. 

Having found x 2 > its significance may be determined by corn 
suiting Fisher’s table for n = 1. It will be seen that if the calcu¬ 
lated value of x 2 is as great or greater than 3*841, then there is 
a significant association between the two variables. 

Alternatively, the value of P may be found by consulting a 
special table given by Yule (Ref. 1 , p. 534). Reference to this 
table will show that for a value of x 2 equal to 3*84, P = 0*05. 
This, of course, agrees with Fisher’s table, but if exact values of 
P are required for purposes of comparison they may be found 
from Yule’s table. 

As in the case of all contingency tables, the nature of the 
association in a 2 x 2 table has to be determined by inspection. 


Example 28. Is there a significant association between X 
and Y in the following data? 

X 
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Frpm formula (24) , 

, 2 _ (65x50 —25 x20) 2 x 160 
X 85 x 75 x 70 x ()0 

This may readily be evaluated by logarithms and we find that 

X 2 = 30-13. 

This value is much greater than 3-84, the critical significance 
value of x 2 for one degree of freedom, so that we may conclude 
that there is a significant relationship between X and Y. 

9.iv. 2x n tables. Cases frequently occur where we wish 
to examine the association between two variables, one of 
which is dichotomous whilst the other is capable of division 
into several, say n , classes. In such cases x 2 may be calculated 
by the method of Section 9.ii, but the calculation of the 
expected frequencies may be avoided in the following manner. 
Suppose the 2 xn table is represented as under, y being the 
dichotomous variable and x the variable with several categories 
1 —four in the example. The observed frequencies in the cells are 
denoted in each case by the letter a with appropriate suffixes 
and dashes. 



*1 

*2 

x 3 

x 4 


Vi 


a 2 


a 4 

n x 

Vi 

< 

a a 

a 3 

< 

n 2 


a l "f a l 

*f 

«3 f 

a 4 + a 4 

N 


If we take a and a' to represent any associated pair of ob¬ 
served frequencies, then we calculate for each pair 


1 

a-fa' 


(awg-a'wj) 2 , 


where n x and n 2 are the total frequencies in the y t and y 2 
t classes. This expression is evaluated for each pair of associated 
frequencies ( and the sum of all such expressions is divided by 
x n 2 ) y giving x 2 - Hence 


X 


>2 — 


n x n x 


S 


1 rv 2 


1 * ) 

a -f a t a J 


(25) 
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The number of degrees of freedom in a 2 x n table is n — 1 , so 
that the significance of x 2 in the above case may be determined 
*by .consulting Table yil for n = 3. 

Example 29. Determine the significance of the association 
between the variables x and y in the following contingency 
table: 




x 2 

*3 

*4 

Total 

Vi 

36 * 

42 

30 

12 

120 

2/2 

12 

.22 

20 

26 

80 

Total 

48 

64 

50 

38 | 

200 


Calculating the expression ^ (an^a'n^) 2, to the nearest 
whole number for each pair of associated frequencies, we have 

^(36 x 80- 12 x 120) 2 = = 43,200, 

48 48 

~ (42 x 80 - 22 x 120) 2 = 5 --~^ = 8,100, 

o4 o4 

^-(30x80-20xl20) 2 = ^- = 0, 

*)U oi) 

i(12 x 80-26 x 120)2 = 4 —= 122,779. 

38 38 


The sum of these quantities = 174,079. Therefore 


x 2 = 


174,079 
T20x~ so 


18-1. 


In Table VII for P = 0-05 and n = 3 we find a value of x 2 of 
7-815. Our calculated value is much greater than this and we 
m&y therefore conclude that there is a significant association 
between the two variables. 


9.V. Goodness of fit. One other use of the % 2 ^nethod may 
be given. Suppose we haVe an observed frequency distribution 
of a variable and wish to*examine the validity of some hypo¬ 
thesis about that distribution. We m&y do this by calculating 
•what thfe distribution .would be on that hypothesis and 
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examining the agreement, or goodness of fit, of the observed and 
calculated distributions. 

As an example, an observed frequency distribution of acci¬ 
dents will J>e given and an examination made of the hypothesis 
that the accidents are distributed in a Pois son series . If the 
probability of an event occurring is very small but a long 
enough period of observation is taken for it to happen some¬ 
times, so that it may happen 0, 1, 2, 3,... tjmes, then the dis¬ 
tribution of its occurrence will form what is called a Poisson 
series, if a large enough number of independent observations is 
made. The frequency distribution for 0, 1, 2, 3,... occurrences 
will be given by the successive terms of the series 


/, m 2 m 3 
\ 1,m ’ ~2l’ 31’ 



where N is the total number of observations and m is the mean 
number of occurrences. 


Example 30. Examine the hypothesis that the following 
frequency distribution of accidents forms a Poisson series. 


Number of Observed 

accidents frequency 


0 14 

1 37 

2 76 

3 70 

4 64 

6 63 

6 31 

7 19 

8 14 

9 9 

10 6 

« 11 6 

12 _ 1 

398 

i f 

< 

In order to Calculate the expected frequency on the assump¬ 
tion that the distribution forms a Poisson series we need to 
know m, the mean number of accidents. This is readily calcu¬ 
lated by the method of formula (2), Gectioft 2.ii. We find that. 
, £(fX) = 1549, whence the mean number of accidents is 
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1549/398 = 3*892. Substituting this value for m and 398 for 
N in the *above formula for the Poisson series, we obtain the 
expected frequencies as,below (see later for method of calcula¬ 
tion): t 


N |mber of 

Expected 

accidents 

frequency 

0 

8-12 

1 

31-61 

• 2 

61-51 

3 

79-80 

4 

77-64 

5 ' 

60-43 

6 

39-20 

7 

21-80 

8 

10-60 


9] 4-59 

10 1-78 

111 0-63 

121 0-20 

13 006 

14j 0-02 

397-99 

In order to satisfy the provision that in calculating x 2 the 
expected frequencies shall not be less than 5, the last six 
entries are grouped together. Since the expected frequencies 
have been calculated to only 2 places of decimals, the total of 
the expected frequencies is not exactly 398, but it is so very 
little less that it can have no marked effect on the value of x 2 * 
Calling the observed‘and expected frequencies O and E re¬ 
spectively, we now construct columns headed (O — E) } (O — E) 2 , 
and (0 — E) 2 /E. The total of the last column is x 2 - All six 
columns are given below. 

We have used ten groups to calculate x 2 but the number of 
degrees of freedom is only 8, since both N and m were fixed in 
the formation of the Poisson series. For n = 8, we find that 
X 2 = 15*507 in Tabfe VII, p. 82. The calculated value is much 
greater than this, hence the Poisson series is a bdd fit to the 
observed accident distributibn. We find, in fact, from Elder- 
ton’s table that for such a value of xVP is less than 0*001, 
ani so we may conclude* that the observed distribution 
certainly does not form a Poisson series. 
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N<3. of 




0 

1 

O 

accidents 

0 

E 

(6-E) 

<M 

1 

o 

< E 

0 

14 

8-12 

5*88 # 

34*5744 

#■26' 

1 

37 

31*61 

5*39 

29*0521 

•0*92 

2 « 

76 

61*51 

14*49 

209*9601 

3*41 

3 

70 

79*80 

- 9*80 

^96-0400 

1*20 

4 

64 

77*64 

—13*64 

186*0496 

2-40 

5 

53 

60*43 

- 7*43 

55*2049 

0*91 

6 

31 

39*20 

- 8*20 

67*2400 

1-72 

7 

19 

21*80 

- 2*80 

' 7*8400 

0*36 

8 

14 

10*60 

3*40 

11*5600 

1*09 

9 and over 

20 

7*28 

12*72 

0*01 

161*7984 

22*23 

38*50 


(Note. The total of the (O — E) column should as usual be 
zero. It is not exactly so in our case since the expected fre¬ 
quencies were calculated to only two places of decimals.) 


Method of calculating the expected frequencies 
The calculation of the expected frequencies in the above 
example is best done by using logarithms. In the Poisson series 
the term Ne~ m is common to each successive frequency, so this 
is worked out first: 


log 398 

= 2-59988, 

logc 

= 0-43429, 

log log e 

= T-63778, 

log 3-892 

= 0-59018; 

hence 3-892 log e 

= antilog (1-63778+ 0-59018) 

therefore 

= 1-69026; 


log 398 £—3*892 = 2-59988 - 1-69026 
= 0-90962, 

whence 398 e -3 * 892 = 8*12. « 

i 

The values of the successive tefms may then be found by a 
tabular metljod, making use of tiie logarithms and logarithms 
of factorials given inf Fisher and Yates’* tables (Ref. 5, pp. 62, 
76). 




9.Vj| X 2 'CONTINGENCY : GOODNESS OF FIT £1 

T^ie method is inchoated below, p is the index of the poorer to 
which rri is raised in the successive terms. 


0 

log m p 

logp! 

i 

log m p — logp! 

(log m p — logp!) 

+ log Ne~ m 

antilog 

0*59018 

0*0 | 

0*59018 

1*49980 

31*61 

1*18036 

0*30103 

0*87933 

1*78895 

61*51 

1*77054 

0*77815 

0*99239 

1*90201 

79*80 

etc. 

etc. 

etc. 

etc. 

etc. 


EXERCISES ON CHAPTER IX 

24. (a) Construct contingency tables showing the relationship be¬ 
tween assessment I in Appendix E and each of the tests from AtoQ 
inclusive. The tables should be 4 x 3 tables, using the four categories 
of I given and dividing the subjects in each test into three groups as 
nearly equal in size as possible according to their scores. Suitable 
points of division for the groups in each test are as follows: 

A (1) 39 and under: (2) 40- 49: (3) 50 and over. 

B (1) 164 and under: (2) 165-194: (3) 195 and over. 

G (1) 19 and under: (2) 20- 29: (3) 30 and over. 

D (1) 24 and under: (2) 25- 35: (3) 36 and over. 

E (1) 115 and under: (2) 116-136: (3) 137 and over. 

F (1) 24 and under: (2) 25- 30: (3) 31 and over. 

G (1) 33 and under: (2) 34- 40: (3) 41 and over. 

(6) Calculate x 2 and the coefficient of contingency for each table 
and determine in which tables there is a significant relationship. 

25. (a) Construct 2x2 tables showing the relationship between 
assessment I and each of the tests from A to G inclusive, and also 
between assessment / # and the order of merit ff. In each case divide 
the assessment I into (1) V.G. and G.: (2) F. and P., and the test scores 
above and below their means as given in the answer to Exercise 3 (6). 

(b) For each table calculate % 2 , using formula (24), and determine 
in which there is a significant relationship. 

26. Investigate the association between test E and assessment I 

by constructing a 2x4 table and calculating x 2 from formula (25).# 
Divide test E into two categories according as the subjects are above 
or below the mean for the test, as given in the answer to Exercise 
3(6), and take assessment I in the four categories es given in 
Appendix E. * * 

27. In a certain experiment readings had to be made from a scale 
which was graduated m tenths of an inch. A particular observer took 
ip00 readings and an snalysistwas made of the frequency of occurrence 
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of the^ final digits of the observations, those shpwing the tenth of an 
inch for which each reading was taken The frequency distribution of 
these digits was as follows: 

Digit 

f 0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

There was no reason in the experiment for any particular final 
digits to occur more frequently than any others. Could this observer 
be regarded as reliable in reading the scale ? 

(The hypothesis to be examined here is that the observed frequency 
of the digits does not depart significantly from that which would be 
expected if the observer were completely reliable. On this hypothesis 
each digit is equally likely to occur, so that the expected frequency is 
the same for each, namely, 100. Calculate x 2 and test its significance 
for 9 degrees of freedom.) 


Frequehcy 

151 

79 t 
95 

109 
50 

185 ‘ 

67 
98 

110 
56 

looo 



Answers to Exercises 


1. Test 

1-2$ 

26-50 


51-75 

76-100 

A 

43-20 

44-68 


44-52 

44-28 

B 

175-76 

176-72 


193-28 

193-72 

C 

28-76 % 

22-92 


23-12 

20-64 

D 

28-64 

33-56 


29-96 

30-00 

E 

125-00 

120-80 


124-40 

130-32 

F 

27-48 

* 27-20 


26-68 

27-72 

0 

35-08 

38-40 


36-16 

37-52 

2. (a) 27-34. 

(6) 27-20. 




3. Test 

(a) 

(6) 


Discrepancy 


A 

44-14 

44-17 


0-03 


B 

185-40 

184-87 


0-53 


C 

23-78 

23-86 


0-08 


D 

30-38 

30-54 


0-16 


E 

125-20 

125-13 


0-07 


F 

27-18 

27-27 


0-09 


0 

36-75 

36-79 


0-04 


4. Test 

(a) 

<*> 


Test (a) 

(b) 

A 

45-5 

44-0 


E 

123-0 

130-0 

B 

167-0 

185-5 


F 

28-0 

25-5 

'c 

25-5 

22-0 


G 

37-5 

39-0 

D 

30-5 

27-5 





5. Test 

. (a) 

(6) 


Test (a) 

(*> 

A 

48-62 

43-20 


E 

123-20 

135-28 

B 

148-52 

169-50 


F 

29-32 

22-10 

G 

24-82 

22-24 


Q 

39-02 

43-32 

D 

29-30 

22-54 





• 

Lower 

Upper 


Semi-inter- 


6. Test 

quartile 

quartile 

quartile range 

A 

38*0 

50-0 


6-0 


B 

157-(f 

203-0 


23-0 


C 

16-0 

* 31-0 


8-0 


D 

19-5 

,40-0 


10-25 


E 

107-5 

142-0 


17-25 


F 

230 

32-0 


' 4-5 


G 

* 29-0 »42-0 


6-5 
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7. Test 

1 

(a) 

(*> 

Test 

(<£) 

(b) 

A 

7*66 

9-76 . 

E ' 

17-34 

17-32 

B 

24*08 

29*98 

F 

4*74 

1 6-5,2 

C 

10*34 

8*80 

0 

7*18 

,6-88 

9 

9*90 

12*50 



* 

8. (a) 20-57. 

(b) 16-05. 

f 


9. Test 

(a) 

(b) 


(c) 

(d) 

A 

11-06 

8*18 


13*77 

10*81 

C 

12*87 

11*09 

# 

11*93 

10*06 

D 

12*03 

12*33 


14*78 

15*05 

F 

5*99 

6*41 


5*90 

7*35 

O 

9*36 

8*43 

» 

8*17 

8*58 

10. Test 

S.D. 

V. 

Test 

S.D. 

y. 

A 

11*24 

25*46 

E 

21*13 

16*88 

B 

37*47 

20*21 

F 

6*46 

23*77 

G 

12*02 

50*55 

G 

8*70 

23*67 

D 

13*76 

45*29 




ii. a = 

0*000002; 

fit = 311, 




7i = 

0*0014; s.e. 

= 0*245, 




72 = 

0*11; s.e. = 

0*490. 





Hence the distribution does not depart significantly from normality. 


12. ^ = 0 0526; fi t = 2-44, 
y x = 0*23; s.e. = 0*245, 
y 2 = — 0*56; s.e. = 0*490. 

Hence the distribution does not depart significantly from normality. 

13. Mean difference = 0*12, t = 0*037. 

Hence the mean difference does not depart significantly from zero. 



Standard 


# 

Standard 



error 

Difference 


error 

Difference 

(a) 

1*421 

7*39 

<d) 

1*520 

3*20 

(b) 

1*827 

6*60 

(e) 

1*628 

6*37 

(c) 

1*365 

3*40 

(/) 

1*084 

9*57 


All these differences are significant. 

15. (a) t = 0*62; difference not significant. 
(6) t = 1*70; difference not significant. 

(c) f t = 1*40; difference not significant. # 

(d) t 4 = 1*29; difference not significant. 

(e) t = 1*94; difference not significant. 
(/) t = d*42; difference significant. 

(g) t = 3*63; difference significant. 

(h) t = 4*25; difference significant. 
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16. 

(a) r- 

0*215. i 


(c) r = 

0*088. 




(61 *• = 

0*246. 

• 

• 

{d) r = 

0*266. 



18« 

A 

B 

1 0 

D 

E 

F 

G 

A 

— 

0*372 

-0*268 

-0*262 

-0*075 

-0*395 

-0*158 


0*372 


-0*170 

-0*068 

0*105 

-0*057 

-0*076 

G 

— 0*268 

-0*170 

— 

0*151 

0*236 

0*309 

0*151 

D 

-0*262 

-0*068 

0*151 

— 

0*032 

0*388 

0*105 

E 

-0*075 

0*105 

0*236 

0-032 

— 

0*285 

0*252 

F 

-0*395 

-0*0^7 

0*309 

0*388 

0*285 

— 

0*289 

G 

-0*158 

-0*076 

0*151 

0*105 

0*252 

0*289 

— 


This table illustrates a method of tabulating tho coefficients of corre¬ 
lation between each pair of*a set of tests. For example, the coefficient 
of correlation between test C and test F is given in the row headed C 
under the column headed F, or in the row headed F under tho column 
headed G, and is seen to be 0*309. 

19. Correlation between F and A; s.e. = 0*0844, 

,, „ F and B; s.e. = 0*0997, 

„ „ F and G; s.e. = 0*0905, 

„ „ F and D; s.e. = 0*0849, 

„ „ F and E; s.o. = 0*0919, 

„ „ F and G; s.e. = 0*0916. 

The coefficient differs significantly from zero in every case except that 
between F and B. 

20. r BE is significantly different from r DF but not from any of the 
others specified. 

21. Tj)F.G = 0*303, 
t cf.d = 0*275, 
t fg.e — 0*234, 
r EF.Q — 0*229, * 
r CE .B = 0-259. 

22. (a) Between H and G (1-25), p = 0*026, 

H and D (1-25), p = 0*141. 

£ (6) Between C and D (1-25), p = 0*171, 

From Exercise 16(a), r = 0*215. 

23. (a) For the regression of G on F, z = 0*3387; hence the re¬ 
gression is not lineal*. (Alternatively, the variance ratio rp 1-97. That* 
in the Fisher and Yates 5 % point table for n x = 12 and = 60 is 1*92; 
hence the regression departs frc|m linearity.) 

For the regression of F on G, tho variance due tc? deviations from 
linearity is less than that within arrays; hence the regression is linear. 
(6) = 0*530,°^ = 0 # *394. 
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24 .. 

Between 

X 2 

C 

r P 


A and I 

6-821 

0-25S, 

0-339 


B and I 

3-392 

0-181 

0-757 


C and I 

8-987 

0-287 * 

0-174 


2> and I 

13-835 

0-349 

0-032 Significant 


E and I 

16-432 

0-376 

0/312 Significant 


F and I 

50-085 

0-578 

0-000001 Significant 


O and I 

16-352 

0-375 

0-012 Significant 

25 . 

Between 

X 2 

P 

< 


A and I 

0-62 

0-431 



B and I 

0-47 

0-493 



C and I 

2-52 

0-112 1 



D and I 

5-70 

0-017 

Significant 


E and I 

10-38 

0-0013 

Significant 


F and I 

38-30 

0-0000001 Significant 


O and I 

8-20 

0-004 

Significant 


H and I 

5-77 

0-016 

Significant 


26. x 2 — 11*58. P = 0-009, hence the association is significant. 

27. x 2 = 160-02. For this P is exceedingly small. Hence the hypo¬ 
thesis fits the observed data very badly and we conclude that the 
observer is unreliable. Note that he has an undue tendency to record 
readings ending in 0 or 6. 
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Table of Jn 1 .n l (n 1 +n 2 -2)l(n 1 +n i ) 

t 



i 

11 

12 

V* 

14 

15 

16 

17 

18 

19 

10 

9*49 

9*98 

10*44 

10*89 

11*33 

11*75 

12*15 

12*55 

12*93 

13*30 

11 

5*98 

10*49 

io*or, 

11*45 

11*90 

12*34 

12*77 

13*18 

13*58 

13*97 

12 

10*44 

10*98 

11*49 

11*98 

12*45 

12*91 

13*35 

13*78 

14*20 

14*60 

13 

10*89 

11*45 

11*98 

12*49 

12*98 

13*46 

13*92 

14*36 

14-80 

15*22 

14 

11*33 

11*90 

12*45 

12*98 

13*49 

13*98 

14*46 

14*92 

15*37 

15*81 

15 

11*75 

12*34 

12*91 

13*46 

13*98 

14*49 

14*98 

15*46 

15*93 

16*38 

16 

12*15 

12*77 

13*35 

13*92 

14*46 

14*98 

15*49 

15*98 

16*46 

16*93 

17 

12*55 

13*18 

13*78 

14*36 

14*92 

] 5*46 

15*98 

16*49 

16*99 

17*47 

18 

12*93 

13*58 

14*20 

14*80 

15*37 

15*93 

16*46 

16*99 

17*49 

17*99 

19 

13*30 

13*97 

14*60 

15*22 

15*81 

16*38 

16*93 

17*47 

17*99 

18*49 

20 

13*66 

14*35 

15*00 

15-63 

16*23 

16*82 

17*38 

17*93 

18*47 

18*99 

21 

14*02 

14*72 

15*39 

16*03 

16*65 

17*25 

17*83 

18*39 

18*94 

19*47 

22 

14*36 

15*08 

15*76 

16*42 

17*06 

17*67 

18*26 

18*84 

19*40 

19*94 

23 

14*70 

15*43 

16*13 

16*80 

17-45 

18*08 

18*69 

19*27 

19*84 

20*40 

24 

16*03 

16*78 

16*49 

17*18 

17*84 

18*48 

19*10 

19*70 

20*28 

20*85 

26 

16*35 

16*12 

16*85 

17*55 

18*22 

18*87 

19*51 

20*12 

20*71 

21*29 

26 

15*67 

16*45 

17*19 

17*91 

18*60 

19*26 

19*90 

20*53 

21*14 

21*73 

27 

16*98 

16*77 

17*53 

18*26 

18*96 

19*64 

20*30 

20*93 

21*55 

22*15 

28 

16*29 

17*09 

17*87 

18*61 

19*32 

20*01 

20*68 

21*33 

21*96 

22*57 

29 

16*69 

17*41 

18*19 

18*95 

19*68 

20*38 

21*06 

21*72 

22*36 

22*98 

30 

16*88 

17*72 

18*62 

19*28 

20*02 

20*74 

21*43 

22*10 

22*75 

23*38 

31 

17*17 

18*02 

18*83 

19*66 

20*36 

21*09 

21*79 

22*47 

23*13 

23*78 

32 

17*46 

18*32 

19*15 

19*94 

20*70 

21*44 

22*15 

22*84 

23*52 

24*17 

33 

17*74 

18*61 

19*45 

20*26 

21*03 

21*78 

22*50 

23*21 

23*89 

24*55 

34 

18*02 

18*90 

19*76 

20*57 

21*37 

22*12 

22*85 

23*57 

24*26 

24*93 

35 

18*29 

19*19 

20*05 

20*88 

21*68 

22*45 

23*20 

23*92 

24*62 

25*31 

36 

18*66 

19*47 

20*35 

21*19 

22*00 

22*78 

23*53 

24*27 

24*98 

25*67 

37 

18*82 

19*75 

20*64 

21*49 

22*31 

23*10 

23*87 

24*61 

25*33 

26*04 

38 

19*08 

20*02 

20*92 

21*79 

22*62 

23*42 

24*20 

24*95 

25*68 

26*39 

39 

19*34 

20*29 

21*20 

22*08 

22*92 

23*73 

24*52 

25*28 

26*03 

26*75 

40 

19*60 

20*56 

21*48 

22*37 

23*22 

24*05 

24*84 

25*62 

26*37 

27*10 

41 

19*85 

20*82 

21*76 

22*66 

23*52 

24*35 

25*16 

25*94 

26*70 

27*44 

42 

20*10 

21*08 

22*03 

22*94 

23*81 

24*66 

25*47 

26*26 

27*03 

27*78 

43 

20*34 

21*34 

22*30 

23*22 

24*10 

24*96 

25*78 

26*58 

27*36 

28*12 

44 

20*58 

21*60 

22*56 

23*49 

24*39 

25*25 

26*09 

26*90 

27*68 

28*45 

45 

20*82 

21*85 

22*83 

23*77 

24*67 

25*54 

26*39 

27*21 

28*01 

28-78 

46 

21*06 

22*10 

23*08 

24*04 

24*95 

25*83 

26*69 

27*52 

28*32 

29*11 

47 

21*30 

22*34 

23*34 

24*30 

<25*23 

26*12 

26*98 

27*82 

28*63 

29*43 

48 

21*53 

22*59 

23*60 

24*57 

25*50 

26*40 

27*28 

28*12 

28*94 

29*75 

49 

21*76 

22*83 

23*85 

24*83 

2o*77 

26*68 

27*57 

28*^2 

29*25 

30*06 

50 

21*98 

23-Q6 

24*10 

25*09 

at 

26*04 

* 

26*96 

27*85 

28*72 

29*56 

30*37 


& 
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20 

21 

22 

23 

24 

25 

« 

26 

27 

28 

& 

10 

13-6C 

14-02 

14-36 

14-70 

15-03 

15-35 

15-67 

15-98 

16-29 

16-69 

11 

14-35 

14-72 

15-08 

15-43 

15-78 

16-12 

16,15 

16-77 

17-09 ’ 

17-41 

12 

15-00 

16-39 

15-76 

16-13 

16-49 

16-85 

17-19 

17-53 

17-87 

18-19 

13 

16-63 

16-03 

16-42 

16-80 

17-18 

17-55 

17-91 

18-26 

18-61 

18-95 

14 

16-23 

16-66 

17-06 

17-45 

17-84 

18-22 

18-60 

18-96 

19-32 

19-68 

15 

16-82 

17-26 

17-67 

18-08 

18-48 

18-87 

11-26 

19-64 

20-01 

20-38 

16 

17-38 

17-83 

18-26 

18-69 

19-10 

19-51 

19-90 

20-30 

20-68 

21-06 

17 

17-93 

18-39 

18-84 

19-27 

19-70 

20-12 

20-53 

20-93 

21-33 

21-72 

18 

18-47 

18-94 

19-40 

19-84 

20-28 

20- ( 71 

21-14 

21-55 

21-96 

22-36 

19 

18-99 

19-47 

19-94 

20-40 

20-85 

21-29 

21-73 

22-15 

22-57 

22-98 

20 

19-49 

19-99 

20-47 

20-94 

21-41 

21-86 

22-30 

22-74 

23-17 

23-59 

21 

19-99 

20-49 

20-99 

21-47 

21-95 

22-41 

22-86 

23-31 

23-75 

24-18 

22 

20-47 

20-99 

21-49 

21-99 

22-47 

22-95 

23-41 

23-87 

24-32 

24-76 

23 

20-94 

21-47 

21-99 

22-49 

22-99 

23-47 

23-95 

24-42 

24-87 

25-32 

24 

21-41 

21-96 

22-47 

22-99 

23-49 

23-99 

24-48 

24-95 

25-42 

25-88 

25 

21-86 

22-41 

22-95 

23-47 

23-99 

24-49 

24-99 

25-48 

25-95 

26-42 

26 

22-30 

22-86 

23-41 

23-95 

24-48 

24-99 

25-50 

25-99 

26-48 

26-96 

27 

22-74 

23-31 

23-87 

24-42 

24-95 

25-48 

25-99 

26-50 

26-99 

27-48 

28 

23-17 

23-75 

24-32 

24-87 

25-42 

25-95 

26-48 

26-99 

27-50 

27-99 

29 

23-69 

24-18 

24-76 

25-32 

25-88 

26-42 

26-96 

27-48 

27-99 

28-60 

30 

2400 

24-60 

26-19 

25-77 

26-33 

26-88 

27-43 

27-96 

28-48 

28-99 

31 

24-41 

25-02 

25-62 

26-20 

26-78 

27-34 

27-89 

28-43 

28-96 

29-48 

32 

24-81 

25-43 

26-04 

26-63 

27-21 

27-78 

28-34 

28*89 

29-43 

29-96 

33 

25-20 

25-83 

26-45 

27-05 

27-64 

28-22 

28-79 

29-35 

29-89 

30-43 

34 

25-69 

26-23 

26-86 

27-47 

28-07 

28-66 

29-23 

29-80 

30-35 

30-90 

35 

25-97 

26-62 

27-26 

27-88 

28-49 

29-08 

29-67 

30-24 

30-80 

31-36 

36 

26-35 

27-01 

27-65 

28-28 

28-90 

29-50 

30-10 

30-68 

31-25 

31-81 

37 

26-72 

27-39 

28-04 

28-68 

29-31 

29-92 

30-52 

31-11 

31-69 

32-26 

38 

27-09 

27-77 

28-43 

29-07 

29-71 

30-33 

*30-94 

31-53 

32-12 

32-70 

39 

27-46 

28-14 

28-81 

29-46 

30-10 

30-73 

31-35 

31-95 

32-55 

33-13 

40 

27-81 

28-60 

29-18 

29-85 

30-50 

31-13 

31-76 

32-37 

32-97 

33-56 

41 

28-16 

28-87 

29-55 

30-22 

30-88 

31-53 

32-16 

32-78 

33-39 

33-99 

42 

28-51 

29-22 

29-92 

30-60 

31-26 

31-92 

32-56 

33-18 

33-80 

34-40 

43 

28-86 

29-58 

30-28 

30-97 

31-64 

32-30 

32-95 

33-58 

34-21 ' 

* 34-82 

44 

29-20 

29-93 

30-64 

31-33 

32-01 

32-68 

33-34 

33-98 

34-61 

35-23 

45 

29-53 

30-27 

30-99 

31-69 

32-38 

33-06 

33-72 

34-37 

35-01 

35-63 

46 

29-87 

30-61 

31-34 

32-05 

32-75 

33*43 

34-10 

34-76 

35-40 

36-03 

47 

30-20 

30-95 

31-69 

32-41 

33-11 

?3-80 

34-47 

35-14 

35-79 

36-43 

48 

30-52 

$1-29 

32-03 

32-76 

33-47 

34-16 

34-85 

35-52 

36-17 

36-82 

49 

30-85 

31-6£ 

32-37 

33-10 

33-82 

i 34-52 

35-21 

35-89 

36-56 

37-21 

50 

31-17 

31-94 

32-70 

1 33-44 

34-17 

34-88 

35-58 

36-26 

36-93 

37-59 
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APPENDIX A (cont.) 


c 

30 

_7^_ 

31 

32 

33 

34 

35 

36 

37 

38 

39 

10 

16-88 

17-17 

17-46 

17-74 

18-02 

18-29 

18-56 

18-82 

19 98 

19-34 

11 

1-7-72 

18-02 

18-39 

18-61 

18-90 

19-19 

19-47 

19-75 

20-02 

20-29 

12 

18-52 

18-83 

19-15 

19-45 

19-76 

20-05 

20-35 

20-64 

20-92 

21-20 

13 

19-28 

19-66 

19-94 

20-26 

20-57 

20-88 

21-19 

21-49 

21-79 

22-08 

14 

20-02 

20-36 

20-70 

21-03 

21-37 

21-68 

22-00 

22-31 

22-62 

22-92 

15 

20-74 

21-09 

21-44 

21-78 

22-12 

22-45 

22-78 

23-10 

23-42 

23-73 

16 

21-43 

21-79 

22-15 

22-50 

22-85 

23-20 

23-53 

23-87 

24-20 

24-52 

17 

22-10 

22-47 

22-84 

23-21 

23-57 

23-92 

24-27 

24-61 

24-95 

25-28 

18 

22-75 

23-13 

23-52 

23-8? 

24-26 

24-62 

24-98 

25-33 

25-68 

26-03 

19 

23-38 

23-78 

24-17 

24-55 

24-93 

25-31 

25-67 

26-04 

26-39 

26-75 

20 

24-00 

24-41 

24-81 

25-20 

25-59 

25-97 

26-35 

26-72 

27-09 

27-45 

21 

24-60 

25-02 

25-43 

25-83 

26-23 

26-62 

27-01 

27-39 

27-77 

28-14 

22 

25-19 

25-62 

26-04 

26-45 

26-86 

27-26 

27-35 

28-04 

28-43 

28-81 

23 

25-77 

26-20 

26-63 

27-05 

27-47 

27-88 

28-28 

28-68 

29-07 

29-46 

24 

26-33 

26-78 

27-21 

27-64 

28-07 

28-49 

28-90 

29-31 

29-71 

30-10 

25 

26-88 

27-34 

27-78 

28-22 

28-66 

29-08 

29-50 

29-92 

30-33 

30-73 

26 

27-43 

27-89 

28-34 

28-79 

29-23 

29-67 

30-10 

30-52 

30-94 

31-35 

27 

27-96 

28-43 

28-89 

29-35 

29-80 

30-24 

30-68 

31-11 

31-53 

31-95 

28 

28-48 

28-96 

29-43 

29-89 

30-35 

30-80 

31-25 

31-69 

3212 

32-55 

29 

28-99 

29-48 

29-96 

30-43 

30-90 

31-36 

31-81 

32-26 

32-70 

33-13 

30 

29-50 

29-99 

30-48 

30-96 

31-43 

31-90 

32-36 

32-82 

33-26 

33-71 

31 

29-99 

30-50 

30-99 

31-48 

31-96 

32-44 

32-90 

33-37 

33-82 

34-27 

32 

30-48 

30-99 

31-50 

31-99 

32-48 

32-96 

33-44 

33-91 

34-37 

34-83 

33 

30-96 

31-48 

31-99 

32-50 

32-99 

33-48 

33-96 

34-44 

34-91 

35-37 

34 

31-43 

31-96 

32-48 

32-99 

33-50 

33-99 

34-48 

34-97 

35-44 

35-91 

35 

31-90 

32-44 

32-96 

33-48 

33-99 

34-50 

34-99 

35-48 

35-97 

36-44 

36 

32-36 

32-90 

33-44 

33-96 

34-48 

34-99 

35-50 

35-99 

36-48 

36-97 

37 

32-82 

33-37 

33-91 

34-44 

34-97 

35-48 

35-99 

36-50 

36-99 

37-48 

38 

33-26 

33-82 

34-37 

34-91 

35-44 

35-97 

36-48 

36-99 

37-50 

37-99 

39 

33-71 

34-27 

34-83 

35-37 

35-91 

36-44 

36-97 

37-48 

37-99 

38-50 

40 

34-14 

34-71 

35-28 

35-83 

36-38 

36-91 

37-44 

37-97 

38-48 

38-99 

41 

34-57 

35-15 

35-72 

36-28 

36-84 

37-38 

37-92 

38-45 

38-97 

39-48 

42 

35-00 

35-59 

36-16 

36-73 

37-29 

37-84 

38-38 

38-92 

39-45 

39-97 

43 

35-42 

36-01 

36-60 

37-17 

37-74 

38-29 

38-84 

39-39 

39-92 

40-45 

44 

35-84 

36-44 

37-03 

37-61 

38-18 

38-74 

39-30 

39-85 

40-39 

40-92 

45 

36-25 

36-86 

37-45 

38-04 

38-62 

39-19 

39-75 

40-30 

40-85 

41-39 

46 

36-66 

37-27 

37-87 

38-47 

39-05 

39-63 

40-19 

40-76 

41-31 

41-85 

47 

37-06 

37-68 

38-29 

38-89 

39-48 

40-06 

40-64 

41-20 

4i-76 

42-31 

48 

37-46 

38-08 

38-70 

39-31 

*39-90 

40-49 

41-07 

41-64 

'42-21 

42-77 

49 

37-85 

38-48 

39-11 

39-72 

40-32 

40-92 

41-50 

42-08 

42-65 

43-22 

50 

38-24 

38-88 

* 

39-51 

40-13 

40-74 

• 41-34 

41-93 

42-51 

43-09 

43-66 


7-2 
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APPENDIX,A ( coni.) 



40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

10 

19-60 < 

19-85 

20-10 

20-34 

20-58 

20-82 

21-06 

21-30 

21-53 

21*76 

11 

20-56 

20-82 

21-08 

21-34 

21-60 

21-85 

22-lQe> 

22-34 

22-59 

22-83 

12 

21-48 

21-76 

22-03 

22-30 

22-56 

22-83 

23-08 

23-34 

23-60 

23*85 

13 

22-37 

22-66 

22-94 

23-22 

23-49 

23-77 

24-04 

24-30 

24-57 

24*83 

14 

23-22 

23-52 

23-81 

24-10 

24-39 

24-67 

24-95 

25-23 

25-60 

25*77 

15 

24-05 

24-36 

24-66 

24-96 

25-25 

25-54 

25-83 

26*12 

26-40 

26-68 

16 

24-84 

25-16 

25-47 

25-78 

26-09 

26-39 

26-69 

26-98 

27*28 

27-57 

17 

25-62 

25-94 

26-26 

26-58 

26-90 

27-21 

27-52 

27-82 

28-12 

28-42 

18 

26-37 

26-70 

27-03 

27-36 

27-68 

28-01. 

28-32 

28-63 

28-94 

29-25 

19 

27-10 

27-44 

27-78 

28-12 

28-45 

28-78 

29-11 

29-43 

29-75 

30-06 

20 

27-81 

28-16 

28-51 

28-86 

29-20 

29-53 

29-87 

30-20 

30-52 

30-85 

21 

28-50 

28-87 

29-22 

29-58 

29-93 

30-27 

30-61 

30-95 

31-29 

31-62 

22 

29-18 

29-66 

29-92 

30-28 

30-64 

30-99 

31-34 

31-69 

32-03 

32-37 

23 

29-85 

30-22 

30-60 

30-97 

31-33 

31-69 

32-05 

32-41 

32*76 

33-10 

24 

30-60 

30-88 

31-26 

31-64 

32-01 

32-38 

32*75 

33-11 

33-47 

33-82 

25 

31-13 

31-53 

31-92 

32-30 

32-68 

33-06 

33-43 

33-80 

34-16 

34-52 

26 

31-76 

32-16 

32-56 

32-95 

33-34 

33-72 

34-10 

34-47 

34-85 

35-21 

27 

32-37 

32-78 

33-18 

33-58 

33-98 

34-37 

34*76 

35-14 

35-52 

35*89 

28 

32-97 

33-39 

33-80 

34-21 

34-61 

35-01 

35-40 

35-79 

36-17 

36-56 

29 

33-56 

33-99 

34-40 

34-82 

35-23 

35-63 

36-03 

36-43 

36-82 

37-21 

30 

3414 

34-57 

35-00 

35-42 

35-84 

36-25 

36-66 

37-06 

37-46 

37*85 

31 

34-71 

36-16 

35-59 

36-01 

36-44 

36-86 

37-27 

37-68 

3808 

38-48 

32 

35-28 

35-72 

36-16 

36-60 

37-03 

37-45 

37-87 

38-29 

38-70 

39*11 

33 

35-83 

36-28 

36-73 

37-17 

37-61 

38-04 

38-47 

38-89 

39-31 

39-72 

34 

36-38 

36-84 

37-29 

37-74 

38-18 

38-62 

39-05 

39-46 

39-90 

40*32 

35 

36-91 

37-38 

37-84 

38-29 

38-74 

39-19 

39-63 

40-06 

40-49 

40-92 

36 

37-44 

37-92 

38-38 

38-84 

39-30 

39-75 

40-19 

40-64 

41-07 

41-50 

37 

37-97 

38-45 

38-92 

39-39 

39-85 

40-30 

40-76 

41-20 

41*64 

42-08 

38 

38-48 

38-97 

39-45 

39-92 

40-39 

40-85 

41-31 

41*76 

42-21 

42-65 

39 

38-99 

39-48 

39-97 

40-45 

40-92 

41-39 

41-85 

42-31 

42-77 

43*22 

40 

39-50 

39-99 

40-48 

40-97 

41-45 

41-92 

42-39 

42-86 

43-32 

43*77 

41 

39-99 

40-50 

40-99 

41-49 

41-97 

42-45 

42-92 

43-40 

43-86 

44-32 

42 

40-48 

40-99 

41-50 

41-99 

42-49 

42-97 

43-45 

43*92 

44*40 

44*86 

43 

40-97 

41-49 

41-99 

42-50 

42-99 

43-49 

43-97 

44-45 

44-93 

45-40 

v 44 

41-46 

41-97 

42-49 

42-99 

43-50 

43-99 

44-49 

44-97 

45-45 

45*93 

45 

41-92 

42-45 

42-97 

43-49 

43-99 

44-50 

44-99 

45-49 

45-97 

46*46 

'46 

42-39 

42-92 

43-45 

43-97 

44-49 

44-99 

45-50 11 45-99 

46-49 

46*97 

47 

42-86 

43-40 

43-92 

44-45 

44-97 

45-,49 

45-99 

46-50 

46-99 

47*49 

48 

43-32 

43-fe6 

44-40 

44-93 

45-45 

45-97 

46-49 

46-99 

47-60 

47-99 

49 

43-77 

44-32 

c 44-86 

45-40 

45-93 

43-46 

46-97 

47-49 

47-99 

48-60 

50 

44-22 

44-78 

45-32 

45-87 

46-40 

46-93 

47-46 

47-97 

48*49 

48*99 



APPENDIX B 


1Q1 


C'tkve showing values of N necessary for 
different values of r to be significant 
TV 



If a calculated r fifom a sample of N observations is larger thsui the* 
value given by the curve corresponding to tha’t vali*e of N, it is 
significant. 0 


Conversely, if the nrynber of observations, N, in £ sample yielding 
a particular value of r is larger than that given by the curve corre¬ 
sponding to that value *of r, then r is significant. 

# In boiler-line cases .the standard error of r should always be oalcu- 
* lated as a check. 




APPENDIX C 

o 

Table of values of 6jN(N 2 -1) for different values of N 

The required values are the reciprocals of the numbers below. 


N 


N 


N 


10 

165 

30 

4495 

50 

20825 

11 

220 

31 

4960, 

51 

22100 

12 

286 

32 

5456 

52 

23436 

13 

364 

33 

5984 

53 

24802 

14 

455 

34 

6545 

54 

26235 

15 

560 

35 

7140 

55 

27720 

16 

680 

36 

7770 

56 

29260 

17 

816 

37 

8436 

57 

30856 

18 

969 

38 

9139 

58 

32509 

19 

1140 

39 

9880 

69 

34220 

20 

1330 

40 

10660 

60 

35990 

21 

1540 

41 

11480 

61 

37820 

22 

1771 

42 

12341 

62 

39711 

23 

2024 

43 

13244 

63 

41664 

24 

2300 

44 

14190 

64 

43680 

25 

2600 

45 

15180 

65 

45760 

26 

2925 

46 

16215 

66 

47905 

27 

3276 

47 

17296 

67 

50116 

28 

3654 

48 

18424 

v 68 

52394 

29 

4060 

49 

19600 

69 

64740 


To obtain p by the use of the above table, divide 2{d 2 ) by the 
number in the table opposite the appropriate value of N and subtract 
the answer from 1. 
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iPPENDIX D 

Squares of numbers from 1 to 99 

No. Square No. Square No. Square No. Square 

1 1 2ft 676 61 2601 76 6776 

2 4 27 729 62 2704 77 6929 

3 9 28 784 63 2809 78 6084 

4 16 29 «841 64 2916 79 6241 

6 26 30 900 65 3025 80 6400 

6 36 31 961 56 3136 81 6561 

7 49 32 1024 67 3249 82 6724 

8 64 33 1089 58 3364 83 6889 

9 81 34 1156 59 3481 84 7056 

10 100 35 1225 60 3600 85 7225 

11 121 36 1296 61 3721 86 7396 

12 144 37 1369 62 3844 87 7569 

13 169 38 1444 63 3969 88 7744 

14 196 39 1521 64 4096 89 7921 

15 225 40 1600 65 4225 90 8100 

16 266 41 1681 66 4356 91 8281 

17 289 42 1764 67 4489 92 8464 

18 324 43 1849 68 4624 93 8649 

19 361 44 1936 69 4761 94 8836 

20 400 46 2025 70 4900 95 9025 

21 441 46 2116 71 5041 96 9216 

22 484 47 2209 72 5184 97 9409 

23 529 48 2304 73 5329 98 9604 

24 676 49 2401 74 5476 99 9801 

1 25 625 | 60 2500 75 5625 




APPENDIX E 


Numerical material for exercises 

The following two pages of figures give the scores of 100 subjects 
in certain psychological tests. The scores in seven such tests are given 
in columns A to O inclusive. Column H gives the ranked order of 
merit of the subjects in a scholastic examination, and column I gives 
an assessment of the practical ability of the subjects in some handi¬ 
craft. This assessment is in four categories—Very Good (V.G.), Good 
(G.), Fair (F.) and Poor (P.). 

These two pages of scores provide the basis for most of the exercises 
at the end of the successive chapters of the text. The student who 
wishes for additional exercises may easily make up others for himself 
from the figures given. 
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APPENDIX E 


bjeci 

A 

B 

0 

D 

E 

F 

G 


/ 

1 

41' 

164 

31 

45 

121 

31 

'29 

50 

V.G. 

2 * 

49 

168 

13 

30 

145' 

31 

35 

12 

• G. 

3 

41. 

132 

20 

5 

108 

25 

25 

73* 

F. 

4 

35 

165 

31 

25 

141 

30, 

40 

28* 

<V 

5 

,35 

152 

30 

27 

96 

18 

28 

92 

P. 

6 

t>5 

156 

45 

33 

154 

35 

49 

4 

V.G,* 

7 

5CU 

153 

38 

41 

102 

27 

to 

94 

F. 

8 

38< 

165 

42 

62 

156 

31 

26 

11 

G. 

9 

• 49 

161 

30 

24 

147 

28 

39 

30 

F. 

10 

. 46 

156 

30 

37 

124 

27 

68 

52 

F. 

11 

28 

167 

30 

23 

98 

19 

,33 

93 

F. 

12 

64 

204 

46 

15 

92 

19 

18 

96 

P. 

13 . 

. 48 

203 

36 

35 

170 

36 

41 

1 

G. 

14 

31 

139* 

2 

38 

126 

29 

35 

52 

G. 

15 

45 

191 

41 

19 

127 

28 

38 

52 

F. 

16 ■ 

49 

156 

44 

11 

131 

22 

33 

35 

F. 

17 

66 

240 

18 

20 

111 

17 

36 

60 

P. 

18 

. 46 

146 

26 

40 

105 

26_ 

21 

75 

P. 

19 

21 

138 

44 

31 • 

122 

39 

27 

54 

V.G. 

20 

27 

180 

30 

30 

100 

33 

60 

84 

G. 

21 

50 

277 

19 

27 

152 

34 

s39 

18 

G. 

22 

. 49 

174 

5 

16 

125 

25 

27 

55 

F. 

23 

26 

176 

44 

38 

116 

34 

40 

73* 

G. 

24 

39. 

157 

15 

29 

125 

19 

45 

56 

G. 

25 

52 

274 

9 

15 

131 

24 

25 

36 

P. 

26 

• 48 

167 

13 

10 

94 

22 

47 

95 

F. 

27 

44 

152 

7 

42 

80 

30 

30 

100 

V.G. 

28 

50 

187 

25 

36 

109 

24 

41 

87 

F. 

29 

• 49 

193 

12 

32 

114 

31 

23 

70 

P. 

30 

45 

196 

30 

40 

158 

26 

38 

7* 

G. 

31 

61 

184 

6 

10 

94 

25 

26 

97 

F. 

32 

* 49 

222 

22 

28 

137 

27 

40 

31 

G. 

33 

38* 

196 

24 

53 

127 

36 

34 

57* 

V.G. 

34 

41. 

188 

31 

35 

137 

30 

37 

37 

G. 

35 

50 

234 

15 

28 

107 

36 

45 

72 

V.G. 

36 

52- 

201 

6 

43 

121 

30 

34 

57* 

F. 

37 

32 

143 

39 

37 

142 

32 

■51 

33 

G. 

38 

35 

177 

32 

35 

141 

28 

40 

33 

F. 

39 

44 

156 

21 

30 

115 

19 

39 

76 

P. 

40 

61 

245 

20 

49 

121 

29. 

42 

59 

V.G. 

41 

39* 

163 

42 

43 

106 

27 

27 

77 

F. 

42 

52- 

181 

31 

43 

109 

29 

45 

71 

V.G. 

43 

m 47 

150 

19 

20 

129 

27 

55 

38* 

G. 

44 

29 

146 

14 

59 

133 

35 
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